84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

345

Click here to load reader

Transcript of 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Page 1: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf
Page 2: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Statistical Analyses

Fou~tb Edition

Sophia Rabe-Hesketh Brian S. Everitt

Chapman & HalllCRC TaylorhFrancls Group

BOG1 R n h London NwYwX

Chapman B HalllCRC ir an imprint of the Taylor 8 Francir Group, an informa hsiner.

Page 3: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapman & HalllCRC Taylor & Franc~s Group bOOll Bmken Sound Parkway NW, Suite NO Raca Raton. FL 33487-2742

? O O i by T3yior k Franc~s Group. LLC Chapman & HalllCRC: is an imprlnt of Taylor & t'rnncistimup,an Informa huslner<

No claim to orieinal U.S. Government mrks Printed in the United States ofAmerica on acid-he W c r 1 0 9 8 7 6 5 4 3 2 1

International Standard Book Numbcr-l& 1-58488-756-7 ( M ~ c w e r ) international Standard Book Number-lP978-1-58488-756-O(5oftcovcr)

This book cantalns infurmanon obtained fmm authentic and highly r e g r d d 4ource5. Reprinted maternal m quoted with permission, and sou- arc indicated. A wlde varicty of referenc~s an? listed. Reasonable ef iorts hwc been made topubl~sh re\iablcdata and informat~on, but thcauthot and the publtsher cannot assume responsibility fortthe wl~dlty of all rnatensls or lor the conse- quenccs oithcit use.

No p n of this bonk may be reprinted, repmduced. tranmittcd, or utilital In any form by any eleclmn~c, mechanml, or other mcana. ~ n v k n ~ n or hercalterlnvcnted, includ~ng photocopytng, mlc~rhlmlng, and rccordmg, or rn any information storagc or retrietnl system, h'lthoet \trrlncn permission from the publlahcrs.

For permission t e phowcopr or use rnatcrlal clectmnically fmm this work. please access w W .

~opyrighl.com ~ h t l p : l ! w w . c o p y r i g h t . ~ orcontact the Copyright Clcamnce Center, lnc. (CCC) 222 Roscwood h ive . Danuers. MA 017T3.978-?SO-8100. OCC is a not-for-pmfil organlaation rhst providcs I~ccnses and registraLionforn varicty of usws. For organisstions that hew been granted a photocopy Iren.rc byfhc CCC, aseparstcsptcm ofpayment has been arrangd.

TrademarkNorife: Product or corporale names may be ttndemarks or rqi&wd trademarks, und are used onlv for ordent~ficatlon and mplanatirm withnut intent to infringe.

Library ofComgnss -lo&-in-Fublication Data

Rabe-Heskelh,S A Handhtrok of slatisticel amhscr urmg Statu I Sophia Rabe-Hcsketh. Brian S.

Everitt. -- 4th ed. crn.

lncluda h~bliqmphlcal references and Isdcx. ISBN 3-58488-756-7 (arid-fri* pa& I. Stata 2. Malhematrai stak~st~c~--Data processing. 1. T~crltt. Brian It. Title

Visit the T y l o r & Fraocis Web site at http:IIww.taylor~ndIran~E~.~om

and the CUC P m 1 Web site at h t t p : l l w ~ w . q r r s s * n m

Page 4: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Dedication

To my parents. Birgit and Gmrg Kahe Sophia R d1c1 lesketh

To nly ruifc, Vary Elizabeth Brian S. Kv~ritt.

Page 5: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Preface

Stata is an exciting statisticnl p h g c that oiFcrs all standard m d many non-standard rn~hclds of data. analysis. In adrlitiorl to g ~ r l ~ r d methoils such as linear, logistic m~cl Poiisson regrpssion, and generalized linear moilels, Stata providm ~nmg more special i~~d mlalpcr. silch as g~nrralixcd e.;t,irr~nling equations front hiostatistics anrl thc Heckman selection model from crnnorrkctrics Stnla has extcnsivs cflp>~bilil,ietr for the analysis of survival data, time wries, pand (or longitudinal) data, and complex survcy data. For all wtimat.ion problems. inrcrcnt:r.q m n ha madc more robust to model misspeciiicatiou uslng huotslrapping or robust standard errors bmcd or1 the sar~dwich estimator. In each new relemc ~CStata, its capa1)ilitics ;irp signifirantly cnhanwd by .a t ~ a r n of exuellcnt statisticiaim and dcv~lnpws at StataCorp.

hlthougli cx-tr~mely powerful. S~ata is easy tn use, c i t h ~ r by point- awl-dirk or thtniigh its intuit ivg command svritax. Applicd researchers, students, and m~tt~odologists thcrrfnre dl Ii~icl Statn n r~a7;irding e n ~ i - ror~~ncnt for rnnnipulatine; rlatx, cnrr,ying out statistical anal-, and prrxlucirlg puhlicalion qztditv paphics.

Stata. also provides a powerful prograinrning lrtllguage ~ru~king it c ~ y to irnplcm~nt a 'tailor-madc' malysis [or a pa~ticillar >~pplicat.ion or to write more gcncral cornmends for use hy tkc wirler Statn cornmu- nitj?. In Fact m consider S tc~ ta an ided cnviror~mcnt for dtvcloping ;uld

dissmriinating new rn~t~l~orlology. First, the eleganre and con~istency of the programming lariguage rs apgcaling for ~n~t~hocloiogists. Sccontl, it ic; simple t.0 niakc new commantlv behave in cvery way like Stata's own corn~nands, rrlzking t h ~ r n act-sible lo applied mllrd~ws and sru- dents. Third, Stata's email lisl.~;rrvcr Sfntalest, The .Stata J o ~ ~ r n ~ l : the Statra Uwx' Group LTwtillp, anti thc St,ativticnl Sof i~vn~c Camponcnt.j (SSC) archive on the intcr~icl all rrinkc ~ x t l ~ a r ~ g e ;md discwqion of ncw comrnands extrencly ewy. For these rcmons Stata is r:onstamtly kept

Page 6: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

lip-t*dat.r with rccent dcvelopmenllj, ~lot just, by its own duvefopcrs, h~l l a1.w hy a very ~ c t . i w St:ita cornrn~uliLy.

This llandbook fo1low.q the format of ibs t.wo pwrlccmsors, A Hfitad- book nf Sfatistical Aao1v.g~ U3in.g S- PI, Cr5' arid A IIanrdbooX: of Stal,w- ficnl A ~ m t ~ ] , s i ~ Us i r~g SAS. h d t chapter deals with the nriaZpis appre priatc: for a particular app1ic:ation. A brief acr:ouot of Lhe statistical Lwkgrormd is inrlurlcd in each chapter irrrliiding refcrei~c~,~ to Ilie lit- erature, hut the primary focns is on how t,o u.w Slats! and how t o intevprel mu1t.s. Cjur hope is Lhat, this approad1 will provirie n, r~seful cornplemu~t to t . 1 ~ cxr~lleiit but wry astcnsiw Slatra marn~als. T h c majority of the (.xarnpIw are, drawn from arcw ill whicl~ t,hc s.uthorti have most expericricc, hut we hope that current and poteritial SLata users fronl outside thew arcas will have little (r011ble ill idcldifyii~g tlie wlrvance of t l ~ c anslyscs dcscl.ibed for thcir nwn dr~tta.

I n thc fnr~rt l l edition, NT have added marly new excl.ciscs ham1 on new dzt,asets. For exerciueq m r b c l with the symbol : nrlswcrs arc provided i r i tile appcnclix. For thc rctriaining rxcrrjsss, a solntions nianua7 is available from Chapn~an k Hall/CIIC for coursr? instructor%.

PxrLicular thanks are dnc to Nick Coy who p~ovidcd us with cx- tensive general comments for the ~et:orld: third, mcl fourth cditions of 0 1 1 ~ hook: 2 n d aIso gave us clear gliitlallce as to how bmt t,o u ~ c a mi~nbcr of Stah cumnlands. We are d x n grateful to Anders Skrondal fur a>rnmenting on sevcrrsl dl.aH;s of the Ilti~d edition. Various people at St.anbGorp have been very liclpful in preparing the s ~ o a d , third, mr5 bnr lh cditions c ) f Lllis book. We ~vould also like to nr:knowledge rllc usehilne~s of the Stata NctCw~rses in the prepnratiori of the first stlitinn of this book.

-411 Ille tlatnsets call he rlo~vnloederl from:

I~irlividud dat,eset.s can also br read directly into Stata from Ille abavr: sits Tjy specjFyil~g thc fill1 path. F(II. cxrtn~ple, to r e d tllc dat,a wagepan. dta br Exerrise 1.2. ilyp the Fullowing comn~and:

use h % t p : / / m .stata. com/texts/stas4/wagepan

Page 7: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Contents

1 A Brief Introduction to Stata,.. ....... . ....... .... . .... ... .... . .l 1.1 Gctting help nud infmtltirsr 1 1.2 Rilnnir~g Statn 2 1.3 Convr~>tions uscd in thin 1 ) m k 9

1 .I Datasets i n Statn '1

1.5 Stdn wmmmds 13 1.13 Dnta r r ~ a n ~ n l c n t 19

1.7 Est-imntirm 22

1.8 Graphics 24

1.9 State w rr cdculator 30

1-10 Al;~t,rix cdcdalious iidng Mala 32 1.11 Brirf introduction to prograrn~~iixig 34 1.12 Kccpinp: Stata lip to dnlc 39 1.13 Exnrcim 40

2 Data Description and SirnpIe Inference: Female Psychiatric Patients. ..... ... .... . .... ....... ...+ .... .... .... .... .. 43 2.1 Descri]>tion of data 43

2.2 Group cmnp~risuri and corrrlxtium 4G 2.3 Arldvsi~ 11Xinp Stat= 44 2.4 Xxprciws 57

3 Multiple Regression: Determinants of Pollution in U.S. Cities ..... ... . ... . .... ... . .... ....... .... .... ... . .... .... ... . ... 61 3.1 Dwcriplion of rlntn 61

:t.2 The multiplr rep.wiiuo l ~~odc l 83 3.3 I l n d y x i s usir~g Slntn &I 3.4 Exerris~.s 82

vii

-.

Page 8: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

4 Analysis OF Variance I: Treating Hypertendon .... . ..... 85 4.1 Dwrriplion of data 83 1.2 Andybix or mrienm rnodcl 85 4 3 hndysis using Sla l i i 87 6.4 Exrrcism BLT

5 Analysis of Variance Ir: Effectiveness of S'lirnrning Clinics . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . - . . . . . . . . . . . . . 101 5.1 Dpwriptirm oidxta 101 5.2 Anolysir; of vwirtncc r11odc1 102 5.3 Arlalysis using Slratx 104 5.4 Fxercim 108

6 Logistic Regression: Treatment of Lung Cancer and Diagnosis of Heart Attach ... .. . . . . . . . . . . . . . .. . . . . . . . . . 111 6.1 Descrrptlon of r l ~ t u 11 1 6 2 'I'llc lthgistic r~grcssion model 112 fi.3 A ~ ~ d y s i s using Stata 116 (I 4 EXCICIW 129

7 GencraIized Linear Models: Australian School ChiIdren .... . ........... . ... ........ .........., . . . . . , . 133 7 1 Dwcription of dnta 133 7 2 Ucnernlizd linem rru,dels 1.34 7.3 Ar~nlysis u~rrin~ Stata 139 7.4 Excrcw 153

8 Summary Measure Analysis of Longitudinal Data: Treatment of Post-Natal Depression .......... . ........... . 157 8 1 1)mriptiou of data 157 8 2 The mlndysis of bngitncti~lal data 159 8 3 Analysis r~sillg Stata 159 8.4 Fkc~rihe? 170

9 Random EEects Madels: Thought Disorder and Schizophrenia ... ........ .... ....... . ........... . ,....... . 173 9.1 Drscriptirni oi rlatn 173 9 2 Random eKect? m d e b 173 9.3 Analysis uhina, Slattk 171( 0.4 Thought d'irder data 190 9.5 Exercises 199

10 Generalized Estimating Equations: Epileptic Seizures and Chemotherapy ........... , ........... . .......... 201 la 1 T)~cription of dntn 201

Page 9: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf
Page 10: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter I

A Brief Introduction to Stata

1.1 Getting help and information Stah is a general purpose statlslrcs package rl~vclop~d and maintamctl h\ StataCrlrp. Thew arc SPVPTA~ form5 or "fla~~ors" of Stata. t h ~ Stan- r lud Tnt~~~oulccl St&. the morc Irmited Small StaCa. StataISE (Spp c~al Edition) which can hnndlr cxlreinely large d a l a ~ ~ t s . anrl Slaln/RlP (.\lultiple Pi-O~CSWIS) wl~ich runs in parallel on up to 32 p1otessors. E d flavor mists for Wintloriu (2(10[), XP, and Inter vcr- ~IOIIS), IJnix platJforms, and the filacinl.osh hlninst all Stattta, features tl~scuswd in Illis hook are ronirlion a{soss platforms.

Thc b a a dociirnentation set For Stttttta ronsists of e l ~ l ~ t mnanua14 (StataGrp 2003-h): G~ttzng Sturf~rl mth. Staln, Satn L's~r's Gmde Base Referencp Manz~aL (thrcc volum&), Doln Mnnaqement Rr:{er- cncr ~Mni~zatnl, Gmpphirs Rqfrrrnce Munud, xrld Qw~rk ReJ~wnce and Indrx. Jn wlrlitio~i t h ~ r c are more speelaliwd rrfercnw r~ranuals wch ns thr Stola Pm,qrammzng RcJennrc Manersl and the Statn Lort,qtttrrIz- nnl/Pancl i'lotu Rc/ewatc Afun~nnl The r e h c ~ ~ c ~ ~mluals provide PX- trerrielr rletmlrri inforrnntiorr on cadi commrb~ld w h i l ~ the ( J S C T ~ G~i$de deucril,cs Statn, mow gccncrallv. Fcaturrrs lhnt are specific: to Lhe oper- at~rkg sptern are de\t~lbcd in the appropriate Gptting SferLed man~id, c g., Gctfzng Stnrtcrl with StoLa for IVtndow~.

Ench S t d a cornnmn<l hay asocrnt,~rl \vjth it a hclp file that m y bc viewed within n Stnta scssion rlbixll: l l ~ e help kw~arllitv. J3ot.h t11c help-files and thc marruals refer to LIIP Base R ~ f ~ r e n w Ma~a~aob hy IR] name of entry. to the U s ~ r ' s Gwdr 137 [Uj chapter or section number and

Page 11: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

2 A H a n h k uf Sht6ctdc(rI A~inlyb~d Usrag Stnta

name, the Gmphits hlnrn~tsl by [GI name of entry, rt;c (see St~tcl Grtfrng $ta?-ted munual, imm~tlixtely after the tahlr of contents, for a cornplcte list).

Thcre arc an incrcming nurnhcr of s e n c d introdur:tory books on Stxtrtn, inclilding the book yrm we macling now, Kohler and Kre~~t*?r (2005), and Acurk (2U06). In addition, tltere are. books on Stntn for ~>artirular types of analysis such as ca tcgnr1~~1 data ir~iaI.ysysjs (Long and Frwse, 2UOB), survival analysis (Cle-ves, Gould and Gutiene~, 20041, gemrdized linear ~wdels (Hardin nnd Hilhe. 2006), ant1 mtiltilevel and longitudinal models (Hahc-Hcsketh and Skrondrrl, 2005). The wcb site http: I/-. stata. com/bookstore/statabooks. html provirlw ugtw date iriformatior~ on thew and other hook5.

Thc St,ata wrh page at, http: //m. stata. corn offers much uscfid informatior~ for l ea ru i~~g Statn inc:Ir~cling an cxtcnstvt: ~ r i e s of "fre- quentlv asked rlupstions" (FhQs). Stnta also offem Internet coumw, m l l d I\lefCo?srses. Tl~ese c o ~ ~ ~ w s take place v i ~ . a tenlporery mailing Iist for coursr organi~xrs and 'kttenrlerh". Each wcek, the wursc or- gani7fxs send out lcci,ure notw and exerriws which the itktcnderf; can d~scuss wit11 each ot-her until the orgardacrs .send out fils miswers to llip

exercises and t,a the qu~stimls raiscd by attcnders. The UCLA Acaiernic Technology Services offer nsefnl textbook and

papm cxnmplei at http:/fvcrcr.ats.ucla.edu/stat/stata/, shaming how analpw can he carried out ~rsiug Stata. Alm very helpC11l for learning SGata itrc the rcpllar coIwnnv Spe&~g SlaCa and Statn Tipa i11 Thp Stnta Joamtat sw ht tp : //mu. stata-journal. corn. I t is pwsible to piircbaue ~ndividual iqsues, ur a compilation nl Statx Lips hy Newton and Cox (2OOtS).

One of lhe exciting aspccts nf bring it Stala user iri Ijeing part of ra vcry active St,ata cumrnunity ns scflcctd in the Ixwy Statabt m~i l - i 1 1 ~ list,, Stata IJsrrs' Gn~up meetings taking piam evwy year in hhr UK, USA znrl var io~~e nthcr rounLrirs, find the large number of wcr- cuntribrlbed SStata programs; sw xIso Sttclion 1.12. Stcldeat aT.w bunc- tions as 0 t~chnical support servit:? with Stat3 star &nd expert users sncll as: Nick Cox nfferi~ig very I~elpfnl re-qponsm to questions.

1.2 Running Stata I This section gives an rlvcrview of what happens in a t,ypicd St,tlta ses sion, refer~.ing to si~bscqu~ent sections for mrn-e dctails. We are using thc %findoms version herc and sornc featnrm may bc diRewnt in Stah for other plat.fom. Wc t.herefore ref.onmct~tl cnnsulting the &!,ling Started Wsth Stnta manual for your phlform.

Page 12: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

1.2.1 States windows

Ct'hen Siata is started. a scl.ec11 opens xq shown in Figure 1.1 containiq four windows Inbelwl:

r Command: here cnrr~rnandu are iuwerl inter;lct.ivcly I Res~tlLq: hprc resrllts are displayed

Review: hcre dl commnnds issued withir~ the current Stit,a ses- sion are shuwn Variables: l~ers thc variables of the current dst.aset are listed

female 11 male 121 11 female 121 18 female 324

Figure 1.1 : Stat.n windows.

Each OF the Stata wioriows can be rmiwd and moved aro~ind in the nsual way: the Conirnand, R ~ v i c w , and Vrlrizhlcs windows wn also be moved outsidc the inain windm (undorkctl) in which casc they m i l l not movc along with l , I ~ t . m a i n Stata window. To I.>ri111: an *namlorkcd

Page 13: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Analyms Using Stata - -

wiridorv furward t l r r~ t inny be nhscured bv other wintlows, make the appropsiatc sclcction in the Window merm. To dock a wlldmv, tirag it 1,wk into the r r l x i r ~ window h trnnspmcnt blur box appears in place of Lhc window being clraggd and rlo.oc:kirlg guides appem at the ceuta and e d p s of the ~rmin window. FLclcm the moiwe buttun when the I rmspaent blue box is or1 tlic appropriate docking guide, for inslance on the arrow pui~it.ing down, tn dork the window st the boltom of the main Slain window.

The fonts in a wi11dow can he rhangrd by clicking the right m w w button tnrrr t . + i ~ W ~ I I I ~ O W . All these settings are i~iltor~~atically saved when Staka is eloscd. Use the Manage Preferencm select,irjn fmm t,he Prefs menu to save m d load spcrifc settings, for inslonce a large fur~t s t t i ~ for tcacl;>ing, or t.o reload the ~ L T O ~ (or default) s~t~tings.

Thrcc oll~er t,ypw of windows cari be crcntcd wit,hin a Stata session: Vicwer window to view help or log lilcs, Graph windows to display gr>~pllu, ~ 1 1 1 Defile Editors to build and riin scripts (called do-filw).

Stala daLaset,s have the . dta exte~lsiorr and cnn he Loatled into Stutz in the 1 1 ~ 1 1 ~ 1 way throug11 thc File rnrnu (for red ing othcr data Formak; see Scclinn 1.4.1). As in ot,hrr statistical packngcs, a rlatwet is u matrix wllext: tile ~ w l u n m reprcscnt vnriahlw (witti names and labels) and thc rows rcpreselit obsenat,io~ls. When a (lalaset is open. the vwrinblc nsmw and vurial>b labcls appem iin the Variables windmv. Thp datwpt rriny bc vicwotl rtq A spre~dsheet by upcnina thc Data U r o m r with t11c a knltton and edikd by clicking tu upwi thc Data Editor. Both ~ I I P Data Dram-srr arl(1 the Daln Eclrior can also Ije opcncd through thr Window mcnu. Sate, however, tll;~t nothing cEse can hv done i11 Stnla while the Data Bromcr or Data Editor is opm (c.g., the Go~nlm11(1 window disnpp~~rs) . See S~ct iou 1.4 for marc infomatior> or1 dntascts.

1.2.3 Commands andoutput

Until release 8.0, Stnta was cntirely cornma~ld-driven md many u-m still preicr using commands as bllows: a con~mand is typccl in the Ct~rnni;uitl wirlrlow and exeruted by p r c s s i ~ ~ thc Ret?im (or f i a t t ~ ) key. The co~n~nand i h m RPPPIU.H next t o a full s h p (peii(>d) in t l c Stata Rclultx window. follm,ed tjy the output.

TC thc output pmducwl 1s long-er than lhe b s ~ r l t s wiridow, --more-- appears al thc bottom of the screen. Pressing any key staolls the outc put forwfirtl une x r m n . Tllc scmll-bar rmay be usccl to move I I ~ 4 ~ ~ 1

down previously displq?ed oi~tput. Howcvcr, only a ( :hain amounl of

Page 14: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

paht uuty~ l is retained i r ~ thc Rw~ilts w~ndm,. For t h ~ s Iceson md to save o u ~ p u t for lat&x. it is usefill to open a log file; soc SecLion 1.2Ji. II is por;sit>le to c o p and print sciccted ontpul from t h ~ Rwdts win- dow. Edit + Copy Table can lw uscd to cupy and paste tahlw so ~ l ~ n t cnlmnw wc w;clm-rtt~xl by tcxbs making it cnsy to producc rs nicely fo~ matted table for Inst.mce In IVord.

Stata is ~ c a d y to ftrcept a ncw command when t . 1 ~ prumpt (a pet~od) appears at the butto~n of tllc wresri. IT Sta t s is not really to receive IIPW crnnnlands I , ~ r . a ~ r s ~ lt is still r111mmg or has not yet displ~~ywl a11 lhc current o ~ ~ t p u t , 11 m q hbr i r l ~ c r r ~ p t ~ 1 1 by )Y11~1diiig dow~i CLrl and

pressing the Pa?u~/Brrak key or hy P ~ I I ~ (he r d Break batton a. A previous command can bc wmwd using thc~ PgUp ant1 Pglln

k e j ~ or by .x?l~rt.ing it from thc Review w~nrlow w h a e all romm%~i*uris frurrl thc current Stdta smsiol~ are llutcd ( W P Flgurt 1.1) The mmmar~d Inay tthw bc d l t r d if required befrm prcssing Reterm to execute the commrrnrl

Most Stata romrnands lcfcr to A Iisl of varixl)les, Ihc hasir syntax h ~ i n g wrr~manrl ~nr1is.t. For cuample, if the datxwt contnrns var~al>les x, y, arid Z, t h ~ n

list x y I lists ~ h c values of x a i d y. Ot,her conjpon~nts nlny be arlded to thc command: for example, xldulg i f c q t ukcr vadirlict causcs (he ronl- mnnd t,n process oixly tl~nse oiscrvat.ions satisfving the logical cxp~uts- $inn e v . Options are s~para t t~ l Srom the trlnjn m ~ n w l d 1jy a comma. The rorrlplctc command htructltr~ and its components arc dcwrihed in Section 1.5.

1.2.4 GUI versus commands I Sillm release 8 0, Sbata hds a Graphical User Interface (GUr) l,hat a!- lows all r~on-proqmri~riil~ commands lo he act-1 via point-and- d ~ c k . Simply start by clirkmg into the Data, Graphics. or Statistics menus, make the relihvnrlt selections, fill in a tIidug box, and (:lick OK. Stata the11 behave': exactly os if the rxlrrcspondinp: cormand h ~ l Ijwrl typcd with t,he command x p p e ~ i ~ y : in the Results and R ~ v i c w wintlows and bcing wcwuiblc vin PgG) arlrl PgUn.

A great advar~taga~r of tmhe menu sysl em is thnt it, is ilrtuitit~ so t,hat n complete ~ioviw to Rtiita wuld learn to run a liiew rcglrssinn in n few mimitcs. A rl~mlmutugc is t , l l~ t pointing arid cl ick~ng rm kip time-con~mnirig and cannot ha nutomatrt l Commands, on thc other hand, ca11 I)c saved in a file (called a dc-file iri Stata) nntl nlri again at a. lat,er time. In out. opinkm. Ihc menu system is rt great device for

Page 15: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

finding: out which cnm~nmld is needed and l~erning how it, works, hut .vrious sbatistiml analysis is best mldertaken using con~rnands. In Uris hook we thmeforc say wry liltle about t.hc xrlenuu and dial09 (they are lar~cly self-explanatory after all), hut ~ e c Section 1.8 for an cxampk uf creating a gap11 t,hrough Iht: diaIogs.

I

I t is eaSy to m m back and forhh between a clialog llox and hclp

for the cotrqonding mrn~nmd; to movc to help, dick into a at the hotlorr~-lafl of t11c d~alog box; to mow to the didog box, rlick on the link at thc tuprlght of the help viewer.

It is U H P ~ U ~ to Imild up a filc alntniuing the commands npccssiiry to carry out a, partictthu data analysis. This may 1,e rlmrc wing Stata's Do-file Editor nr any o t h a edilor. Tlie Do-fib Editor nln-y he opened hy clicking @I. Conlrnands cau then be typed in and run as a batch

eithcr by c h e k ~ n ~ into a in the Dwfile Editor or by 1 1 s i ~ the corrin~nnd

AItenlatiwly, n s n k t t)f commands cnn be lrighlightcd ancl executed by clicking into w. Tkc do-lile ran be saved for use in a filture St,alra smsiun. To opcn a d d l a ; select. Do ... En~m llla Pile menu or open the clc-Cdc! editor and usc its File menu. See Section 1.11 ror rnorc inforrrmtton on do-fibs.

It is us~ful to OPCTI R log file at tVhe beginriing of a Stnta smssion. P1 .e~

thc la~t ton a, type a filename into t.he d i a l o ~ box, m d c h o w Save. By dchult, i l~ ls pcoduccs a SMCI, ( S t a h Markup and Control Lwl- page, pronolmcd "smiclc" ) iilc with cxtensiw~ . smcl, but n plain text (ASCTT) file can hc produced by sc?lectillg the .log ext,enxion Tr t h ~ lrle alrentty ~xists, atlother diaIng: opens lo allow you to decide whcth~er to ov~rwiite thc file with ncw output or to append ncw tmtput to tlie existing file.

Tile log fils car1 be viewed in thc Stam Vicwr during tlic StaLa sesion (ag~n t h r o n ~ h sb. For long lng files, it can bc uwful to rlick

inlo @ in the Stda Viewer to *arch for a piece of Icxt. Thc lop, file is autc~matiwtlly saved when it is rloswl. Log film c a r dso be opand, viewetl, and closcrl by soIecting Log from the File menu: follo~wl by Begin.. .. View.. ., or Close, r c s p ~ t . i ~ l y . f l i p fullt~u,il~g commands

Page 16: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

rm bc ttwd to open mid cI&,P a log CiEa mylog, repltxciug thc old one i f i~ already sxistts:

log using mylog, replace log close

Tn view a Ing filc prnduccd in a prcvicn~s Statma ses~ion. sdccl File 4

Log -+ View ... and specify thy full path of the log file. The log irtay r h ~ r l he printcd by selccling Print -+ Viewer horn the File menu.

It is also possible to translate SMCL filcs to plnin text filcs anrl pice vwsa using t,he File menu nr the translate m111mand. To satre thc o ~ ~ t p ~ i t in thc Results window as a plain text log file, type

translate QRessults mylog.txt

1.2.7 Getting help

Help may be rhtaincd by clicking on Help which brings up the mcnu 5hown in Figlire 1.2. To get Ijelp on a Statn cnmmard, assuming tit:

Cwiem sbarh... 5 t h c m d... what's M/#>

. ? -- - . - w OIhtLdUp&e SJadUsa- hwm 3ata Web St& b

------A-

nba*aPa

Figurc 1.2: hlcni~ for help.

co~ninmd rmmc is know~i, sel~xt Stata Command .... To find t l ~ c appropriate Statn commmid first, select Search ... whirl1 opcns I I ~ thc d~alog in Figure 1.3. For ~ x a n i l d ~ , to iind out hcrw to fil a Cox rcgrwslon ~nt>ciel, type "stirviwtY or "rox" ul~der Keywords: and prws OK. This opeus il S t ~ t a . Vic iw containirrg a list of rrkvnnt c:omrnantl namw wrtll a twisf dcsrription. In this CAM stcox i.j the cornmand wc nwrl. Also l i s t ~ l arc t,opim Ibr whiclr k'requcrltly Asbd (2umtions (FAQs) or exnn~gles ttrc availnlhc on tllc web, and lrser cout,riibuld co~ri~nands pnljlished in the Stah Jaurnxl (abbmiakd Sd) or i ts predermr, the Stat& 'krhnicd Bullctin (nbl>rwiald STB). E d entry in thin list

Page 17: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 1.3: Dinlog for search,

includes a blile kcywod (a Fayperli~~k] that may be selacfd to view the appropriate help 81c or nab silc, Each help flc cvntnius hypcrlirlb tn ot.hcr rcl-t I I c I ~ film.

Thc search can nlvu be pwformed via thc corn~rlands

search survival

and hclp 011 the commalid stcox can be failnd using

help stcox

IIelp will thm appcnr in t h ~ Stnta Results window (instead ol the St nca Virrvm) whcre words displnytrl in bluc also represent lly perlin ks to other files.

IF the computer rrmlling S t a t ~ is COnllCCtd to clie inlcr~~et, you cfin H~SO s?arcI~ Lhruilgh "uno(ticitrl" materials un t h ~ h~tcrnct, tu find, for irlstanra. user-cu~itrihutcd prugams no1 publisllwl in tlic Stata Journal or Strata TccIinicaI Hullcti~i (ss 1.12 fur more inIorniat~on), This i s ~rcomplislled hy sclccti11~; '-S~arclr rlct rexa~~rccs" or "Scwc.11 all*' in the scnrcl~ diiilog box. T11e lattcr is cquir~xlent to using the f i n d i t L e p ~ i ~ o d co~n~riantl. h,[ore refind scnrchs ran be carrie~i 0116 using the search co t r l~n~nd (scc help search). The othrr. selections in ~ h c help dialog. News. Official Updates,

SJ and User-written Pragrams. and Stata Web Site, dl prwide access to thr rclcw~lt weh sitcs.

Slnta can be rloscd irl tlrw uqs:

dirk on the Close button a at thc t c ~ right-hand oorner of the Stnttr xm11

Page 18: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

r select Exit from lhc File rrlcr~u type e x i t , clear ill thc StaCa. Coinlnnnds window, z~ id press Hfhi7pa.

1.3 Conventions used in this book In th is hook we will use typewriter foilt Like t h i s for anvth~ng that rould he typed into rhc Shta Command wintlow or a do-file, that is, command nmcs , optio~ls, variable I ~ w . : etc. In contrast, italirizrd mnls are riot snpp(>swl t,o be typed; they should Ile subs t i tu t~ l Ly another ~vord. For cxmnplc: summarize v m n u ~ n c means that warname sho~ild he substituted by a specific variable narrle, s u d ~ as age, giving summarize age. 1% will usudlv display q 1 1 c r 1 ~ ~ s of cumrmncls follows:

summarize age drop age

TT a e o m m ~ d te~ltinucs over two linw, rw me /// at t,ht rrld of thr tir.jl line to rr~akc Stntu iqure the line break. An alternative is to iise /* at the end OF the first linc aricl * J at tlic bcgirir~ing of the secor~d l i~ie to "rommenl out" tbc linclxcak. Kotc thnt thcw ~ncthods arc for use 111 n do-filc uld do nol work in the Colnm~nd wiridow where they woiil(l result in mi error. In the Command window, commalids crw~ wrap over wvexal l in~s.

Output taking littlc spnec is displayed irnrncdiit~cl.~ Following the m~nmnnds hut without indc~~lzlior~ axld in a srlreltcr fonl .

display I

1

Outpi~t taking up more spare is shown in a nillnbered disphy Boat,iug irl tlic text. Surnc! con~mands prorlllct? little riot^, for ex~.~nplc. 'he generate comro~icl prints uut how many missirig vnluw are generated. 1Frc will usually nol show st~cll notcs.

1.4 Datasets in Stata 1 -4.1 Data input and output

Stata has its mi rl&a fr)r~xlilt ~ i t h dsfa~rlt exterisian . d m Head- ing and saving a St,ata filr aye straiglitfomwd. If the filmamp is wagepa. dta. Ihc com~nnnds wc

Page 19: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

use wagepan save wagepan

IF the data arc rlclt, storcd in the current directory, then thc corriplete path must be specified, as in the mmmmd

use c : \user\data\uagepan

(TT the path coutoim spaces, it mnl~G bc eucIow1 Tn quotm, e.g., "c:\my o m data\wagepanU .) H o m w , the least ~rror-pror~e way of keeping all the files f r x a p r t i c u l ~ r projcct in one directory is to ch,mgc! bo that tlircctory and rcfsr to all Blcs wit,hout specifying thc path:

cd c : \user\data use wagepan save wagepan

Note that the datasets of t,his hook ciln also be r e d From a web site by specifying the pot11 http: //m. stata . com/text s/stas4, e.g.:

use http://rrvw.stata.com/texts/stas4/wagepan

Datn supplied with Stata cul be read iri wing thc sysuse cnrnmaud. For instance, the fmous auto.dta data, which arc often used in the St,atn rnarluals, can be read using

sysusa auto

Before rcatiing n filc into St,&tn, all data alreatlv in nremory need to be c:harctcl, eltl~er by rurining clear before thc use cu~nmni~d or by using thc option clear M fidlnws:

use wagepan, clear

Jf we wish to saw data unrler an cxirjting lilcnw~e, this results in an crror messagc ur~lws we usc the option replace RS r o ~ l m :

save vagepan, replace

For l a r ~ c i1atast:t.s it is aometims newswry to itrcrrasc thc arnorrnt of memory Staka allocates to its data arcas. Wr aampl r , when IKI

d a t i ~ t , is loaded (e.g., after isiring the com~nand clear), wt 1 . 1 1 ~ rrlern- ory to 2 megabytes us in^

set memory 2m

Page 20: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

The memory cornm~nrl withoul arpyrnvnts gjvcs information on h w mnc:h rrlanoly is being used and how much is nvnilnhh.

I1 thc data are not availnble in S t a h furmat, they may bc convert~d to Slata format using woll l~r package (c .g. , Stat/Trmdcr) or saved a9 an ASCII lilc (althl~ugh thc bt tcr optiori Incans losing all thc labels). iVhcn samng data a ASCII, mis~irx mlucs shnulrl be rcplaced by some nnnlericd corlc.

There arc t h rw communrls a1'itilable fur reding differer~t types of lSCXT data: insheet is fur files rontairiirlg one observation (on all vari- alJcs) per line with var.l.iaI>lm separated by t,ahs ur coInlnas, rvl~ere thc first linc may corltaill Ihc vnriabl~ naries; inf i l e wit11 va~11st ( f r ~ e for- mat) allmt-s linc breaks to occur an~ywl~ere and variables to he mposated t ~ y spaces, commas. or t dx ; inf IX is fol files wit,ki fixed column for~nat bnt a s111glc observation c;ii~i KO over several Iincs: i n f i l e with a dictiw nary (fixcd Format) is the rnost flexible conllnand since thc dictionary cnll specify exactly what lines rtnd columns con t in what inCormntiou.

Data can be savotl ~s ASCII using outf i l e ur outsheet. F i ~ d l y , t h ~ odbc commilntl can hc tlxed to load, wyitc, or view cIak from Opra Data Base Connectivity ((OD3C) sources. Sec h e l p inf iXing or [U] 21 Inputting data for an oven~ipw of commands for scading data.

Orily one rlataset may I,e loaded at any glvvn tirne hut x Inny he combi~led with the c:urrently loaded dataset using the cornrnur~d merge or append to add obvervalions nr v~rinbIcs, resp~ctivcIy; scc also Scction 1.6.2.

Thew are ~scn l i a l ly txva kinds of variables ir l St~i.n: s t n r ~ q and nu- menc. Ern11 variable can I>r: one of a r ~ u r n b ~ x of storijgc t,ype,s that reqliirc different nuljers of bytm. The storxe types are byte, int, long, float, and double for nurneri~v~niables and strl to str244 for strlna varlit1)les ol different lenkths. Besides thc slorags type, uari~ables have msocintcd wit11 thcnl n name. a label, and a format. The T I ~ ~ T I ~ C

uf u. varial,le y can bc changed to x uslng

rename y x

label variable x "cost in pounds"

mid the format, nf a nurric~,ic mrirrble ctir~ br s ~ t to "ge11lcl.nl n~lineric" n~ith 1 w i decimal plnccs using

format x X 7 . 2 ~

Page 21: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

A rnis~ing value in a numeric varid>Ie is represented I)y a periotl "." (systcni miscjing vzllucs), or hy K pcrrod frsl lod I>y ,y letter, s u d ~ 21s .a, . b, ctc., codw that, can bc n ~ d for dixt, ir~~islring lletwccn dilferent kinds of ~nissing valucy. IvIi~i11g V ~ I I I M arc irit,crpret.etl as very large prlsitivp munhcm with . < .a < .b, etc. Note that this r;tn lead to mi$- takcs in logical ~xpr~ssitms; src also Swtiorl 1.5.2. N~~mcricnl rr~issirig val11e cotles (si~ch a5 "-99'') may be conwl.led to rr!i.ssilig v~ilues ( m r l vice vma) nsing the rolnmmd mvdecode. For example,

mvdecode x , mvc-99)

replmcs all values of variable x aqud to -99 b,y periods find

c l ~ a n g ~ thc missing vnltiw back to -90. Numcric varinble~ cat1 be n a d tcr rcprcsent catcgoricd or continuwf;

vnrixldw including tlatw. Rn. cxt.cn~rical variables it is riot always cmy to rcrncmhcr whitsh oumwical cotlr rrpr~xnts which cntegory. Val~lr! l~bs l s call cl~cr~fole bc ilciind W, kdlows:

label define s I married 2 divorced 3 widowed 4 singla label values marital 6

Thc crttcgo:oriw can nIso be ~.erxlcld. For sxarnplc, t l~c conlmnnd

recode marital 2/3-2 4-3

merges categorim 2 and 3 ir~ttn cutrgory 2 arid cllrtnge*; cntegory 4 tn 3. Datm me dcfinstl a? the nutrlb~r uE days since 1/1/1960 and can

be displnyed using a dizt,c fol*tnat sl~sh aq Xd. Fnr cxrtn~plr, listing thp varinblf: time in 27.0g format given

list time

format time ?d list time

Page 22: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

SP~: help dfmt for othcr (late formals.

String u a ~ u b l e s

String vsrinblcs are typically used for catcgoricnl vmiablm or identifiers arid in somr cavcs frn dates (c.g., if the file uras saved as m ASCTT lilc from SPSS or Excclj. In St-ata. il is generally advisable tc wpresei~i tlime variables hy r)i~rnaric variables, and convc~~sior~ from string lo rii~meric is straightfor~vard. A c:xtrgo~ical stririg variable (or identifier) ca.n be ralnvertcd to a numeric ~ a r ~ a h l e using Lh? c0111rnxnd encode which rrplaocs each iiniquc stling by a clifi'crcr~t integer and 11% that bring Ihc label for t h r oorlcspunrling in ice r value. The corn~nand decode convert.8 tlrc labelrd uiitncric vari~hl? hml< to LL s t r i r ) ~ varial>lr. Tlie commarid destring c a n h~ usctI lo couvert several (or all) variahlw: trorri string to num~ric hy inbcrpret.ing the strings as ~in~rnl.)cis. This is u s ~ f t ~ l IT nu~rleric voriablcu saved from u~lotlier program arc intcrpr~ted b,y Stata as string variilblss. for instarire due to missing valllum being rcpr*swt,rcI by blanks.

A string variaklc &~fnql r~prcwt~tiri~: dates rfin 13c wr~verted to n ~ l m ~ r i c using the 1'11ncLior1 date(stnng l , string$) wI1~1.e ,st,mn(ji! ib a permulntior~ of "dmy" to specify tlte OILICE of the day, monlh, mr1 year in s t r ingl . Fbr example. the com~wntls

display dateI"30/1/1930M, "dmy")

and

display date(" january 30, 19301', "mdy")

I)clth return i l ~ c rlegat.ive valuc -10928 heca~lsc ihc dat.r is 10928 day4 heforc l / 1 / 19(i0.

1.5 Stata commands Typitig help language gives tllc Following generic comn~arid s t r~ l r t~ i r e For mosl Ststa c:ommands:

Page 23: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

The Ilelp file contains links to information on each of the components. and we will h~iefly describe them here:

C p ~ f i T : 3 could be n number of different things; sep help prefix. One cxarnple is by a a ~ l w t : which instruct^ Stntd, to rcpmt the command For C ~ I combination of vdues in the lint of wrinhlm v~rlist .

warmand is the name of thc rornlnand and call often he abbreviated: for exalnplc. the command display mn bc abbrmtiated as di s .

C~)arl$stl is the list of variable< to wllich the commclnrl applies.

C c q l is an expression.

TzJl i f exp restrids thc co~nrnand lu that sulhqct of t.he observations that salisfies bhe Llgicnl expression czp.

[in1 in snnq-e restricts the command to t h e uhservations whrxc in- dicw lie in a parli<:uh range range.

Cw~ci.qhO allows w ~ i ~ l ~ l h to 1)c associate8 with ohhervations (sec Scc- Lion 1.7).

Fusing filename] specifies the R1ename to hc nsi. C, optronsl n comma i.5 only needed if opttons are 1 1 d ; optioris are

specific to thc cxrrnrna~~lxl ulrl ran o f t ~ n bc ahbroviatwl.

Wr any giwn c~rnrnand, sonlc of t h a x mrnponerits may not be avail- able; ror cxrtmple, list doas not allow [usiag jil~neme]. The hclp fiIc for a specific command specifies wtliril camponcnts arc available, using the same x~otation as ahow, wibh sqimm brackets enclosing componenhs t ha t arc optional. For exarnplr., help list givtx

list [ VUT!~$~] [in C ~ Y L ~ C. op~ionsl

implying t h ~ l Cprre$x:l and wrirn~s other c.nmpon~nts are not d I t m d and that nll pe~rni~iblc cornponcnts arc optional.

The syntax for vrsdut, q, and nsnge is desrrihed in the next three sr~hct ionr , followd Ily infonn~tion nn how to loop through wts of mdablrs or ohwrvations.

The sirnpI~st form of undia'al is a list of varizhlc namw separated by I spaces. Varinble namm may also be ahbrevixtwl a< long as t11is is 1 unambig~~ons, e . ~ . , xi may bbc reFerrd to hy x only if Lh~re, is no other variable namc starting u.it1.1 x sucll x itself 01. x2. A set of adjacent wriill~le~ silrh aq ml, m2. artd x may be rdtxrecl t,o m mi-x- All var~al~lm

Page 24: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

starting wif,h the same sct of letters car1 be represeted by that set, of I~rtcrs ~ollowed by a wild card *, so t,l~at m* may stancl for ml m 6 mother. The wild card c-sn appcar anywkcrc witllir~ the variable name; For irih+itnr:t: my*var colilrl rdcr to mylongvat- mtlld myahortvar. The eel OF 2111 variables is refexred t,o by all OY *. Fxarnplcs o l n varlist are

x Y x1-x16 al-a3 my* sex age

sotc that stata is case spn.<itivp. Long tilriable namm are ahhr~vi&ted in Stata's outpul by rc~hc ing

n middle .wction of the namp by a '. This n~cthod 01 abbreviating x-ariithles can also be usccl in a ursdtsl as long os the abbreviation is un- amhiguo~~s. For instnncc, my*var would not work if thwe are variabla mylongvar arid myshortvar in t . h ~ data~ct, wherens myl-var uvuld work.

A iwcful coniuiarid for finding variahlm in FI large dntmct is 100kf o r ~r.l~ich scarcllcs lor a keyword among the varialdr namw ant1 labcls.

Tl~crc arc logici~l, algebraic, and string expressions in Stata. Logical ~xprr~nions c+diaatc to 1 (true) or 0 (false) and use the op~i-ators < and <= for '.less Ihnn" and "less than or equal to", respectively. Similarl?: > and >= arc used Ibl. "greater ttimf anrl "greater thm nr qua! to". Thc synibols == and != (or *=) xtmrl for -'pcltlal to" and "not qua1 to", alrl the ~hitrilrt~vrs ! (or I), &, and I represent "not", "and', and 'or'' : re3p~r.tiveIy. ssn that

means "if y is not equal to 2 and z i s grcatcr than x or il x cquds I". In fact., rxprmions involving variables arc cvnlualed for each o h e m t i o n so that thc expression rcallp mcans

whcrc z is t.hc obscrw~tiorl index. Great care rr~unt bc t;kcn in wing the > or >= O ~ R P ~ L ~ O I ' S WIICII thcrc

are missing data. For sample, if rnp wish to delete dl sul~jccts oldcr t.han 16, r,hc command

drop if agar16

Page 25: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

will dso delctr all wlbjccts for whom age is missing siiincc a missing value (reprw,ntcd by ".': , '.a': : " . b", etc.) is interpreted as a large nuun~her. It is ~blways safm to J l n w for missing values explir.itly usirlg for irstance

drop i f age>l6 & age<.

Notc that tllis is safer than spccitying age ! -. which ~vouId nut exclude rnis~itlg va1nt.x coded as ".aM, " . b", ctc.

Algebraic exprwions use the mud operators +, -, *; /, and - fur arlditiun. subtraction, ~nultiplicatinn, diriRior~, and powering, respec tivcly. Stata also h s m y mathematical fnnctiom such iw sqrt to, e x p o , log(), etc. mid slatistical ft~ncliom such as chiZtail0 md normal 0 for cnmulative distribution fun~tions aurl mvnormal(), etc., fnr in>*-. cun~\\nZwt h ' x ~ C r i b ~ ~ t i o n E I I ~ C L ~ I ~ Y . ?mhh-~%ndo~r~ ~uTK'hW3 with n unifor~n distl.ihtion oti the D-11 inkmd m y be ge~\e~&&uii--~ unlrorm 0. Exan~~lm OF algebraic exprmsiom fie

where invnormal (unif o m 0 1 returns a (diff~rent) draw from Ihe stan- dard n o d dlstrib~rtion for each observation.

Finally, sirring expressions mainly use special . r t r i~~g fi~nctions such ati substx(atr, nl, n2) to w t r c t H. substring from sfr dartirg at nnl for a length of n2. Tlle logical operaturn == md -= arc ah0 dlowed with string vzrinl>lcs alrd the operator + concatmates t.wo stings. For cxample, the rornbincd logicrtl anti string expression

<"moon1') + (subatr("snnlight",4,5)) 1 == "moonlight"

rctnrns thc vnlilc 1 for "tnien. For a list and explanation of all functions, use help functions.

I . 5.3 Obsemation indices and mngee

k h olh=.ervatiorr hw lan indm wsociatorl with it. For ~ x m p l e , th~: value of the third ohemtion on a pal-ticulas variable x may bc ~~FerrRd to as x C31. The m o n take, on the value rd thc r u n n i r ~ index and N is q u a i to the ~lumher of ohermtions. We can therefor? refcr to the previous obcrvatiorl of a tariable a*: x Cn-il .

An indexed vitrinble is only saowed on the right-hand side of a n assignment. Sf wc! wish to replace x C3] by 2, we can do t,llis using the

Page 26: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

- A Brief lnti.&ncCow do A Y & ~ l ~ r W 17 ---

1 "id..

( IVe (:an inrcfer tn a range oC obwrvations tither udng i f with a logi- ' cal rxpressirln ~nvolving s or, lnnrc easily, hy using in mnge. The I roinmarid nborv <:an t11e11 he rcp1ur:ad by

I replace x - 2 in 3

lIurp gcrlerdly. mpmar~t can be x rmgc of indiecs specifircl 11si11g thc arritiu: f / l (for "first to last:') where f anti/nr 1 m y he replaced 1)y ~~urwricnl valii~s i f rerluircd. so that 5/12 mpans "fifth to t ~ e l f t h : ' and f/10 mrms "first to tcrlth", etc. Negative nulnk)rrs arc wed t o collnt From thc crid. for cxarr~ple

list x in -10/1

lisis the lnst I[) ohRerr.ation~.

1.5.4 Looping through variables or obsemations Explidlly looping ~1ro11gIi obsenntion~ is o f t ~ n not nac~ssnrg herausc cxpremiot~s in%-olviug vxiFLhl~% are ~~il~on~atdtically evaluated to*. each ohscrvation. It ma] l?owmvr bc required lo reppat a zcom~nand fur sul>scts of obsmwtinns and this is what tllc prrfw by uurli-qt: is for. Bcfor~ usiul~ by ara~last:, howcver, the dala rr111st bc mrt~c! U S ~ I ~

sort varlisl

whcre uarl~sl, i~~cludw Che vari~blcs to he tlsd inr by vrarlist:. Note that il' war hsr cotrtair~s marc than onc variable, Gics in thc earlier vari- ahles arc sorted =curding i o thr nexl variahlets). For exa~n[>le,

sort school class by school class: summarize test

givc the srlrnrrlury lit,alistiw nof test for each dass If class is Iahded fxom 1 to la, for the it11 sd~ool. t,hcn not using school irl the above cornnlands rvonld result in (lie obserbations for all cl~sses wit11 the same lahcl being grouped together. To avoid having to snri thr data first- the sort optii~n cnn be ilsed as follow:

by school class, sort: summarize test

A very usefill fealure of by uurht: is (hat it causes the: ol,st.rvation iriilex s lo c:ount From 1 wilhi11 each 01 tlis group5 rlefjiicd hj; tllc

Page 27: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

18 W A Hnradbook of S t u t l a f h ! Arardysbs Usan$ stuta ---

distinct cornbinations of the valiie of ~larlast. The macro _M represem the numher af observations in each group. For examplc,

sort group age by group: list age if -n==-W

lists age for the last observation in cwh group whcm the lafit o k - valion in this case is the observation with the highest age within its group. The same can he xhieved in a .single command using the sort option: I

by group cage), sor t : list age if -n=-N I wlicm the variable in parerithm is uscd to sort the data bul does not contribule to the definition of the subgroups of observations to which the list comraiand applies.

We can loop through a list of variableti or 0 t h ~ ~ ohjects using f oreach The siniplest syntax is

foreach variable in vl v2 v3 list 'variable'

1.

This syntax usps a bcaI macro (3ee also Section 1.9) variable which takes on the {string) values vi, then v2, and finally v3 inqidc the loop. (Local macron caa dw be defined explicit.ly using local variable vl.) Enclosing the local mare name in ' ' is equivalenl to typi~lg its contents, i.e., 'variable' eva111at.m to vl, then v2, and fi~rnlly v3 so that esach ol t h e variables is listed in turn.

In the first line above we listcd & variable explicitly. We can instcad use the more generd varltst syntax by specifyi~s that the list is of t,ype v a ~ i i t as foliows:

foreach variable of varlist v* I list 'variable-

3

Numeric lists can also be syccified using foreach. For iristancc, the commaid

foreach number o f numliat 1 2 3 (. display ' umber -

1

pruduw thc output 1 2 3

Page 28: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Sunierir: l i s ~ may be abbreviated by "Iirs~Jlas~", licrc 1/3 or -fi~st(i~~crcmcnt)last", for instance 1(2)T for the lisT 1 3 5 7. Se? help foreach for rrtller list types.

For ~~ilmerir. lists. a sinlpler syntax is f orvalues. To pmducc Lhc outpnt above, usc

forvalues i-1/3 display ' i '

1

The tiarrl! output cat1 also bc groduccrl using while follow:

local i = 1 while i<-3

disp 'i' local i = 'i'+ i

> Hr1.r Ihc local nlarro i wtw rlPlfil1ed ux in~ local i = 1 ~ r l d tllcn i11- r r ~ n w n t d I>y 1 using local i - 'i- + 1. Scc nE.w SccLion 1 . 1 1 ori progmrrirriin~. Cuu (2002b) givcv u iscful liilorinl on by ~)mrlf'sl: nrld C'os (2002s; 2000) dist:~~ssrs f oreach at~d. f orvaluea ill dc~ai l .

1.6 Data management 1.8.1 Generating and changing uarioblea

Sew mi-i~hl*\ Iriay IIP ge~ieratml 11~i1lg Lhc C O I ~ J T I ~ F I ~ ~ generate or egen. Tllc colni~~nild generate cwatr,~ a rirw wrialllt?, c~\wluates ~xpr~ssirnl for each ol~scrva~ion, and plnccs the r m ~ l t into t,he new n~riuble. Wr cxnniplc,

generate x = 1

r.redt,vs n new varial~lc callcd x and scts it cclual to oiw. When generate ir iiwd toget.hrr with i f q n or i n rwr~qc, (lic rcruaining obscrvaLions nrp set to inissing. Fnr ex~rnple.

generate percent - 100*Iold - nee)/old if oldN

gr11lel.atttt.x thr vt~lit~ble percent and scls it cqual to the p~rcantag~ dccrcasc froni old to new whew o ld is positive and cqual lo niisrjng other:mise. The co~nma.ud replace works 111 t l~c s a w way ax generate exrfipt that, it dlou-b an existing ::mini>lc to l ~ c chnngd. For example,

replace percent - 0 if old<=O

Page 29: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

ci~mgw the miwing vdues in the variable percent: to zeros. The two comrnmds above could be replaced by thc sindc conlrna~~d

generate percent = cond(al8>0, IOO*(ald-new)/~ld, 0)

wherc cond(1 evaluates to the second argument if the fist argurrlent is tris rtrld to the third arg~unmt ollierwise.

The conrrnand egen providcq cxtcidons to generate. Onc advan- tage of egen is t ,ht some ~b i t s functions xwpt a variable list as an argument, where= the Eunctionx r r generate can only takr! simple exprcssio~~ as mgummlts. For example. we can form the nveragc of 100 variahlcs r n l to mlOQ using

egm average = rowmean(lal1~100)

where missing values arc ignord. Other functions for egen opcrate on gro~~ps of ohscrvat,ions. For e m p l c : if we have the income (variable income) for members within fruniiies (variahlc family), n.c miay want to mu~plte the total income of each member's family mirig

An existing minble can be replaced using egen functions ordy by fiwt deleting it using

drop x

Another mjl of dropping vnriabla is tdng keep varltst wa~last is the li* of dl variables riot to be dropped.

h vcry uuefiil cornn~and fur rhanging cstegoricd nnrnarh vrtriahlw i s recode. For instanthe, to nlcrgc the first three catcgorieb m d rmude the fourt.11 to "2", tvpe

recode categ 1/3 = 1 4 = 2

If there are my ntl~er ~ U P Y , such rniavii~a values, thew will renuLin nnchangod. SFX help recode for mnrc i~iformdiitinn.

I . ti. 2 Changing the shape of the data

It is frequently necessary to change the shape of data, the mwt comlnon application being grouped data, in p-articululw repeated nlerisurm or pmel data. If rve have rnemurament occa~inris j for xuiljmts i, this may be viewed as a muiltimriat~ data& in which eark accasio~r j k

Page 30: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

r~prcsentd I>y z \mial~lt! x j , and t h~ hub iect iden~ifier is in thc vwiahle slzbj. Howcrvr, fur some statistical analyses we may need cine single, long. response vector containing thc responses for dl oeca~ions for dl sul>jects, as well ~ t . j txvo ~wlahles subj and occ tn reprant the i~ldicm I md 3, respectively. The two "data shapes" are called m d long, rcspoctively. TVc can convert froru Lhe wide shape with variables x j and subj givm h

list

to the 1c111g ~ lmpe with variables x, occ, and subj unir~g Lhe syntax

reshape long x, i (subj) j (occ)

[note: j r 1 2)

3ata r ide -> long

'-umber of aha. a -> 4 :,umber of variables 3 -> 3 J v a r i a b l ~ t [ 2 values) -> occ xi j varlablea:

x1 x2 -> X

The data utw look like this:

list

Kc eon chmigc the data back again using

reshape wide x, icsubj) j Iocc )

For data in thc long shnpc, it may be rquircd t r~ mllapw the dntn so that cach gruup is r~prescntd ly H single surnlnary rrlcnslire. For csanlple, for t,he data ahove. \rc )nay want to crcnte n w variables rteanx, adx, ~ u r l num cont.aini~lg the mean! standard clrvirrtiori, nnrl the nurnhm of nonmissing resporms, respectively. T11is ism be axhiwed using

Page 31: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

collapse (mean) meanx- tsd) s d x q (count) n u n = ~ , by(subj1 list

Since it is not possible to convert hack to the original format, in this caw., the data may he prebcmd before rlmning collapse and rmtorcd again later using the con~nlandv preserve and restore.

Other ways of chaiging he shape of dais include dropping ahcr- vations using

drop in 1/10

by group (wei@t), sort: keep i f -n==l

lo drop dl biit Ihp Eght.est rnembr~ of each group. Somctirncs it may he newwary to transrose the data, oonwding variablw to ohse~*vations and vice versa. This rrlzy bc dune and undoi~e using xpose.

If car11 okwat iun represents a number of unit,s (as ~ f t c r collapse), it may so~~~ct~imea be rcquiyed to replicate ench obzen~atiun Iyy the num- ber of units, num, thal it rcprwcntri. This may b~ done using

expand num

If thcre axe twc datmets, subj .d ta , containing subje~%-specific vari- ables, and occ . dta, containing occasion-wedfie mlablbles For the same snbjects, then if both files contain the same sorted subject identifier subj-id and subj . dta is currenLly laa~lsd. the files may be merged ar follows:

merge subj-id using occ

resulting in the vnriablcz from subj .dta being expanded a5 in thc expand command almvc mid the >nriahI.hle~ from occ . dta bcing rwlded.

1.7 Estimation A11 estimation cu~nnwrlrls in Stttta, for example regress, logistic, poisson, and glm, follow the s m c syitax and hare lnarly of the same

Page 32: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

optiolls. The estiniatlou comma~idu also produce the mme ki11cI of outpnt I

m d save tile same kind of ivfor~natiun The rturcd BfoneaLio11 may hp procrss~rl I L S ~ I I ~ the sarr~c wl of pusl-estimafaol~ rommun(ls.

Tho basil. eornrnal~d ut~.uetarr ib

The responsp miahlo is spcuficd hy d r p ~ ~ a r HIXI tllp explnnabnry w r i - ablfs ly ~ndq)onr,q. If dl~rr~rny va~iablcs and iulernctiur~s arp requir~l for c~t~gorical rxpla~~xlory ~ r i ~ ~ b l m , ~~sir ig the x i : ("int,rmction PX- pansion") prefix enables sperlal notation to hc 11.wd bv sndepva~s. For P X H I I ~ ~ ~ C .

xi: regress t esp i.x

c r ~ a t ~ s dun~cny wiahlcs for each vdue ol x accpl, tlre lowcst ~ l u c and iricludes thew d t ~ m ~ n y vuritlblas as pradictots in the model.

xi: regress resp i.x*y z

fits a rcg~~~ssion m o d ~ l with the mail] rEects of x, y, and z iuld their iuterwtion xxy whrre x is treated ns c;ttego:urical md g nnrl z as con- ti~iuous (aw help x i for fi1rtht.r dct,ails).

The q-nthx for thc [tueightir] opt,iuu iis

wlicrc weighthype dcpcnds or1 l l ic reasor1 for nvighting thc data. II 1110 data arc in the form of x taZ11~ w11ei-c rarh obsrl.vation rcpre~nts, a goup confini~ig a toVal off req r~bscrv~tiotiur~s, using [fweight=f reql is rquimlent to running thc same mtimatio~l conimrtr~d or) thp expantled rlatascl where cach obsermtiol~ has been rcplirated f req tirrlcs. If r h p ohservationb kavc differenl danrLnrd drvlalions, for example, hc vausc they reprcwnt i~verzg~s of cl~fferent, numbers ol olwmtions, t,herl aveights is wcd with weights propnrtiur~wl Lo the rmiprflvnlx of the slanrlartl rlcviationu. Finally, pweights i s wed f o ~ prol)sbilil,y ~vei~lit- ing w1lcr.c the? wcigl~ts me qua1 to Ihc invelw prtlt>al>ilit,y that tach ohsen?tlion was sa~npled. (hnothm type of widlt.~, ixeights. is avail- abllc Tor somc wtirrmtion corn~rlands, rr~ainly for usc hy ~ ) ~ o ~ r a ~ n m e n . )

All the rcsults of a n =tinlation mrri~nantl are slo~rd and can he pro- CFSW~ ubing pustrc~li~natiou C O M I ~ I I I ~ S . For wnmple, pxedlct lnuv 1>c ~ s e d to compidc pr~clictwi valr~ps ur dlffererlt types of rc%itlr~als for Zlia nhscivations in Lh? prweut dat,aset nntl the co~nrn~rldh test. testparm.

Page 33: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Istest: and lincom for inferenwx bnsed on previously e t ima td mod- cls. It i s swy to find nut what posk-estimt,ion cornmallds an: a~aikdhIe for a given estimation command: simply click into "post-csti~rllttion~ at the t,npnghl: or the help file.

The saved m?11ts c w also be accessed directly using the appropriate names. For exnrr~pk, the regression coeffcients are stored in e;lok)al macros called. -bCvap.rtemd. Tu display thc regression ct)e6cient of x, simply type

display -b [XI

To access the entire pmnu~eter w t u r , use e(b3. Many other rpatilts may be accwsed using t.he e(narnr1 syntnx. Sm t l ~ c "Saved nesiiltd' section of the entry for the estimation corr~rrland in the Shta Rejewrtt~ M a n ~ a l . ~ to find out under what nmms particular result< arc stol.cxl. After estimation, the mmmand

ereturn list

lists the riamw arid corlter~ts of afl atinlation results acccslrible via e(name).

Note that "r-clous" results produoml by commandnds that arc not esti- ~rtat.ion ournrnmds can be accessed using r(namr). For exarripb, dter summarize, Lbe mean can he awwecl using r (mean). T h e command

return liat

list the nltmcs and cont~mts of fiU ':'r-clam'' resitits currently available.

1.8 Graphics The graphird user iinlcrface (GUI) makes it, extreu1eIy easy to pro- dnce a very attractive maph with different line-slyles, legcnds, &c. Tu demoriatrnle this, we first simulate mule data M. follows:

clear set obs 100 set seed 13211 gen x = invnormal(uniform()~ gen y = 2 + 3*x + invncrmal (unif o m (1

To prorlncc a sr~ttcrplot of y versus x via thc GUI, sclect Twowq graph (scatterplot, line, etc.) frorn the Graphics nienii and click into Scatter under Plot type to obtain the dialog box shown in Pig;- uw 1.4. SpmrFy x and y in the b m Libeled X variable: md Y

Page 34: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figurc 1.1: Dialng box for twoway grapIi.

variable:. This ran be done ~itlicr Ily typing or by prrssing (ht: littlc down arrow lo wlecl anlory the wriablcs in the datwct. To add x labcl to the x-axis, c1ii:k inGo the tah Iabcle(1 X-Axis and type "Sirnu- 1att.d x" in thc Title hox. Similarly, tyitc ' ,Si~rmlat,d y" in tlic Title hox in t,he Y-Axis Lnl?. Finally. click OK Lo prorl~rce the graph show11 ill Figilre 1.5. To c h q c fnr illstance the symbol wr tuuld have Lo plot the grnph again, this timc selectiilg n rlilrerer~t option in the hox I~l,cIml Symbol urlder thc heding Markers in thc tli;llo:: hau (il is not poss~llle to cdit, s g.nph). The follorving cotnlnaritl appwm lu thc outpul.

I twoway (scatter y x ) , ytitle(~imu1ated y) x t i t l e ( S k l a t e d x )

The cnnlrriar~d tuoway, short fur graph tuoway, cam he wctl to plot wiittrr.plots, lintis or clrrvcu, a~td many oll~er plots 1cquiri1~ an z urld g-axis. H~1.e tlic plothjpe is scatter whir:h reqnircs ;t y and -c vnrial)l~ to bc specifwl. Drr,ail~ such as, axis l ab~ lb arc giwn nfkrr the turnma. Hclp on scat,ral.plot~ .cc,lrt be fourrd (eithrr jn 1 1 1 ~ rnan~ld or usirg help) under '"rap11 twonmy w x t t e ~ ~ . llelp on options for graph twoway cmi he f o u ~ ~ d imdcr ',two~vay-uptinns".

I 7 l o PI.O~IICC ~ ~ v e r a l g~aphs, ~ C I I display~l in H. scparxte winrio!v, tllc graphs mtwt bp givcn diffcr~nt narnm. In thc GIJI this call he nchiavct-l Iiy clickirig int,o thc Overall tab of thc graph dido^ box m r l typing n name int,o t11c hox lat~eled Name of graph:. IT a graph of tlic same ttnint: already cxislx, usc n riifrc~ent mame to operi n IIRW window or click illto the Replace box t o replnce tbc c~~rre i l t grapl'~. lu thc

Page 35: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 1.5: Scatterplot of simulated data.

m -

graph command, names can be rtssigned using t.he opt-inn name(anme of graph) together with the replace option if required. To view x particular maph that may be hidden hahind other windows, use either

the Window menu or click into thc arrow in ' . We cm us.? B single graph twoway c:ommn(l to prodilce a acattw-

plot with a regression line superirnpo~ed:

*

twoway (scatter y x) (lfit y x), / / / ytitleISimulated y) xtitlelSimulated x) /// legend(order(l "Observad" 2 "fi t ted"))

-2 -1 0 1 2 . Simulated w

giving thc graph in Figurc 1.6. Inside each pair of parentl~wcs i s a command spedfylng a plot t o be added to the s m e graph. The options apply in^ to the gaph s a wliole appex after these individual plots prccedctl by a cornmu as mud. Bere the t!egendO uption wed to sl)ccify labels ror the lek~nd; see the rrmuat or help f o ~ "Iegenrlaptiari".

Each plot can have its own if e q or i n range restrictions >M d l as lariaus: optiar~%. To dsmorlst,r;rtt: this, we firbt, [:reate a new tfatiable, gconp, taking the values I and 2 and adcE 2 to the y-values of po11p 2:

gen group = condl-a < 50,i ,2) replace y = y + 2 if $roup--2

Page 36: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

A Brief labrotlmtiim W Slata 1 27 ---.- - -

P -

117 - X

P B z ~ 0 -

4 -

* , -'

4' , ?--' # . $' r a t . -. . .. "*'a .*

*me,-" . . .st> " f *_%+ ..

f ,5'* . : #, &';,: '

4 4'

-2 -1 0 1 2 Simulated x . Observed ----- Fltfed

Figure 1.6: Scatterplot and fitted rephcssion line.

?;ow produce a. scatterplot with different symbols for thc Cwn gro~ps and sgarate regrmsion lines using

twoway (scatter y x if group=-1, msymbolI0)) /// (Ifit y x if group==l, Ipat t ( so l id) ) / / / (scatter y x i f group==2, msymbol(0h)) /// tlf it y x if group==2, Ipatt(dash)) , / / / ytitle(Simu1ated y) xtltle(Simu1ated X) /// Iegend(order(1 2 "Group 1" 3 4 "Croup 2"))

gitir~g the graph shorn in F ipre 1.7. Hcrc, the options msymbol10) and rnspbol(0h) pmducesoljd and hollow circles, respcctii,el,v, whereas lpatt(solid1 and lpatt(dash) produce solid and dashed lines, re- ~pccti~rely. TThm options are inside thc parenthlmw for thc corruspoild- ing The optjons referring to the graph a whole, xtitle 0, >-title(), and legend(). , appear dtcr the ~ndividuzl plots have heen fperified. Just before the final comma, we could dso specify if ezp or in mnge restrictions for the paph as n whole.

Some pmlplc find it more couvcnient to w.paratc plots using ) I instead of enclosing them in pnr~nth~stsri, for instance replating thc first, two Lines oi the c~ramand above hy

twoway acatter y x ~f group==$, ms(0) I I / / /

Page 37: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Gmup 7

28 W A I i a r r d h k of Slutbstrrd Analyses Vmnq St&

= -

In" r a 2 5 E 5 a -

Ln _ 0

-2 -1 0 1 2 Slrnulated x

F i r e 1 .f: Scatterplot and fitted regek~on fine.

lfit y x if group=l, clpatt(so1id)

The b y 0 opt ion can be uscd to produce separate plots ( c d wit.h their ORTI sets of axe) in the same graph. For instanw

Label define gr 1 "Group 1" 2 "Group 2" label values group gr twovay scatter y x, by(gronp)

prud1lc.a thc graph in Figure 1.8. Here the value labels of group are used to label the individt~al panels.

Otkier nseful graphics colnrnanrtv includc graph twoway function for plotting a runtion m-itliout having to d c h e any new vari&blw: graph matrix for scatterplot rnat,riccs: graph box far bcocplots, graph bar for bar cfiarts: hlsto$ram for histomam?. kdensity for kernel dm- sity plots, ar~d qnom for 04) p1ot.s.

For graph box and graph bar , rue may wish to plot different vari- a h l ~ , ~ , referred to ar wars iu Stab, for &&rent subgroups or categorim of individuals. specified using the over() opt-ion. For example,

replace x = x + 1 graph bar y x, wver(grmp)

Page 38: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

rr.~i,1Jts in the bar dinrt in Figure 1.9. Far n-iurc i n f o r m ~ t i o ~ ~ on chang- ::IS thc labbeling XLCI prcs~rltat,ion uf the bars, sw yvar-options and ,-roup-op.tions In IG] graph bar.

The general appeamncc of graphs is dcfincd in Bchemes. In tl~is book we 1 1 s ~ .%he~nc sj (St,ata Journnl) by issuirig t11c lc~mrnand

s e t scheme sj

-.r the Icgirluing of ewh Stata s&inn. See IG] schemes intro ur help schemes for a. c o n ~ p l c t ~ list and darsiption af srhclnes asdlnble.

Grxpl~? can hc s;ivcd i r ~ Statet's .gph format and r e d hwk in using

graph save rrr.ygmph graph using rn?lyrwh

Graphs can also bc ~xpnrtNl as encnpsulrtt~l postscript 01' P 5 G film .i.ing. rmpel:Ciilply,

~ a p h export mygrrsph. eps graph export rnygrfi~~h. png

~ P P help graph export h r nr nit of all uvRil>hle \ torag t y p e . \\'c find tltc CZil irherfw~ pwticular)y IITP~IL~ f o ~ lcar~liug allour:

- hrsc nnrl 0 t h graphics r:ornn~ands anti bhcir options. bIi13chell (2U04)

Page 39: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 1.9: Bar chast.

is a meful refer~ncc bonk which corlttuins a large collcctioll nf graphs and the cnmrnnndq uscd to create them.

1.9 Stataasacalculator Stata can be nsed as L simple ca lcn l~ to~ using the comrnland display fnlloxved by nn expwssian, e.g.,

display sqrt(5*(11-3*2))

3.1622m

There sw dso a number of statistical oornrrlunds that can hc used without refercrm to any varinblcs. T h m cummnnds end in i, wherc i stands for immediate command. Fur example. x w cmi calculate the sarnplc size required for a11 indcperldant smrlplw t-test t o acliiwr: 80% poiver to rictect a differemce at the 1 % Im1 (2-sid~l) if Ihe population means are 1 and 2 and the within population standard deviation is 1 iising sampsi m follows:

sampsi 1 2, sd(i) power(.8) alpha(0.01)

Page 40: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

A Hnef Intuhacttm lo Slata . 31 -

I see I>isplay 1.1). Similarly, ttesti mn be u s ~ d t o c u ~ q ~ oul a t-tcst

4t imatsd r-le size for two-sample compa~ison o f msane.

Test Ro: m l - m2, where ml i~ the mean in population 1 and m2 is the mean in pOpUlatIOn 2

dssuuptim: alpha r 0.0100 Itso-sided) power = 0.8000

mi = 1 12 - 2

ad1 - 1 Ed2 rn I

n2/nl - 1.00

3c1maced required sample aizss:

a1 = 24 n2 = 24

Display 1.1

if tlia means. h-tandard dcviations, and sanipie sizcs are given. As briefly sliown in Sectiori 1.5.4, rcsul8 can I>Q saved in locd macros

xsing the syntax

local a = eq

Thc result i n q then he usel main by enclosing the local macro name in single rli lotes ' - (using two differe~it kcys on the keyboard). For esa~nplc,

l o c a l a = 5 display sqzt('am)

!,latrices can also be dcfincd and matrix dgehrzi carried orrt intm- actively, Thc following matrix mmrnmlds d~fine a matrix a, display 17 , and giw i l s l r a c ~ and its dge~iv,alucs:

matrix a = I1,2\ 2,4) matrix list a

display trace (a)

Page 41: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

matrix symeigen x v = a matrix list v

For morc pouerrul matrix c~lculatior~s, use thc Mato progratnmirkg Inn- gurrge ns dwribed ~ I L tlre next spction.

1.10 Matrix calculations using Mata hla~a is a I'rst, matrix leuiguage rcsamblirjg C. Hcre rvc show how Mnln can he us~d for perforrrjing culc~llntions interactively; Mula CML nlw bc nsed In wrilc prograins thal arc autnxrialidiy conlp~lcd hy Stnt-a al~d ~ P I I C C run fmlrr fha11 pl*clgmms writkn ill St,xts's un~lal prtlgmm- ming Innp;ungc, Sec [MI Mata Relemnct: Man~~uk and help mata for full details.

II) ur~lcr tl.lu Matn crtvimnmcnt, typcmata ill the Commaild witldnrv ant1 press Return. To cxit M u t ~ , typc end followed hy Ret~irm. You cdu lcll if you arc withirk Tvlat,a Lcrxusc! t l l ~ pr0111pt dlanges rrom u period "." t,o R (:01011 ":". An rxsngle oE a M ~ t a sessiun, w il wtn~lrl appear i11 Lhe output., is given in Ilisplny 1.2, We nec Illat, typing MI exprcsaitn~ dixplays tlrc rmillt. Vrcrixbl~ can be drfincd 11si11g the = clperatur u ~ d sul~sequcnt,ly L I W ~ ir~ cxpresuions. Were w* ~ P A I ~ C R 2 x 2 ULBLI'IX A usinfi COI~ITIEIS 10 sepfil'nt,~ C O ~ U I I I ~ I S arlrl \ to >wpwal,r rows ~ . r in Stfita's matrix rclrrirr~nnrls.

The ~1siahles dcfineci iri this SCSR~OTI will uclt bc rlearcd wbcn we cxit, Mata: they will still he thcre H L I ~ can k1c listei1 using mata describe if we re-cnkr Mata

mata mata describe

l bytes type name md extent

32 real matrix B rsal scalar

We me Chat imsociated with c x h object is n t.?pc defi~icd Iy twv a+ pCct,n Stitla calls eltype, here real and orgtppe. hew matrix and scalar. Tu clcnr M M ~ witl~out distnrhi~g Stat&, use the cur~lrnand.

Page 42: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

. mata mmta ( type end to exit1 -

: 2 + 3 5

: a - 2 + 3

: x 5

: a = (I, a \ S. 4)

: A 1 2

I Display I.2

Page 43: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

54 W A Handbook of Shtwtiml Andyses Using Siata ---- --------- ----

mata cleaz

h Mattta, we csn evaluatc complex m ~ t r i x expressions ~ w i q wry intrl- itive notation. For instanre, the least squmcs wtimator (XiX)-'X'y is evrrluatpd using

whme invsym0 stsnds for ' i n m e of a symmetric matrix". Kerc * performs makrix mult~plimtinn. To pcrfurm ele lnc~lt- ly-cmct mul- tiplication, usc the colon operator: . mat*

mata (type md to nit> - : A

: end

1.11 Brief introduction to programming Su far we havc d~xribed rommanda rrs if thcy would be run iriterac- lively. Rowevcr, in prmice, it is always useful ta La nhlc to rcpcat ttb eotlrc xnalysis )sing a single mrnmxnd. This is important, for mamplc, whcn a data entry cmr is detcctcrl after most of the analysis has sl- r cdy been carrid out. It i s dvo importmt to keep a record of all daia nlmipulation .w that it can hc checked md wlrrected later. In Stata, a set of commands stored as a do-file, c d l d for exnlrlple, analysrs .do, can be executed using the curnrnard

do analysis

Page 44: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

\Zie stmngly recommend that readers cwatc do-film for m y wurk in Stata, c.g., for thc exereism of this hook.

Orlc way of generating a dc-file is to carry w ~ t the analysis inter- act ivcly mld saw the commnds, for example, by right-mouw clirkirlg ~ntu the Review window and clirkina illto Save Review Contents ... - Stat,a7s Do-file Editor, which is opcnad by clicking into a can dm bc ~ ~ s c d to ~ ~ e a t , e or edit a do-file. One way of tryir~g o11t cornmads inter- artivcly mid b~lilding up n dc-tile is t o xun commands iri the Commands ~\,indmv and oopy thern into lhe Do-file Editor after checking that r h g work. hotlier pmibility is to type cunim81ida into the Do-Ale Editor and try Lhem out irldividually by highlighting the mrn~mlds and clicking into 'a or sdcdina Tools -+ D o Selection. Alterna- tively. axv tex'xt editor may he used to create a dc-file Tic follmv~ng i s a uscful tcmplete for a do-file:

/* comment describing what the file does */ version 9.2 capture log close log using flellerlomc, replace set more off

fag close exit

We will expkin eadl line in turn.

1. The '~brckets" /* and */ CBUSC Staba to i p w c everything hetwcen thorn, Antrthcr way of comnacnfinq out Bncv of text is to start the Line? eithcr nith au asterisk * or with a double forward slash //.

2. The comrlnnd version 9.2 causes Stnta to i~~t.rprPt all cttrn~nauds ns if we were running S tats versio~~ 9.2 cxrm if, in the fut.um, wc haw actunlly installed a Inlm version in which some o l these cornrnands do not work any more.

.3. Tllc capture prefix cailves t,he defilc to continue riinning men if LIIP mnimand results in an error. The capture lug close command tlwrrfol*s i:loses thc c~rn*cnt log filc ~f uric is open or returns an error nlcsq-e. (Another ~weful prcfix is quietly which s~~ppr~sues all oulput, except error messages.)

4. The command log uslng filename, replace opcns a log filc, re- placing any file of lhe same lime if it already exists.

Page 45: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

36 . A Hadbwk of Slatista'atl Anfllys~n Lking Stnla - - - - - - - - - - -- - - - - - - -

5. The comnland set more off causys dl the output to scroll p ~ t nutornat.icdly instead of nxiting for the 11sa to scroll though it, manualally. This is usefirl if the user intends t,o look at the log file for the output.

6. After the analysis is oompiete. the log filo is clmed usinmg close. 7. The last statement, exit, is not necmsary at thc end of a do-file

hut may he used to make Stata stop ronniug the dccfile wh~rcrw it is placed.

Varialvlw, globd macros, local macros: and matricw can be used for atotihg and referring 1-0 data and these are used exte~dvely ill prw grams. For matnple. we m q t wish to subtract thc mean of x from x. Interactively, we wolfid use

summarize K

to find orlt n - h ~ t the mean is md then subtract thaL d u c from x. Wommrrr, we should not O-pe the vrtlue of the maan into a &-fils becalm thc reslllt. n d d no longer be valid iE thc data change. Ir~steud, we can acrm the mean computed by summarize ]sing rlmean):

quietly summarize x , I D ~ W I P ~ ~

gen xnsu = x-+(mean)

(If dl that i a required from summarize is t h ~ : mean or the mm. ir is more cfticimt to llse the meanonly option.) Most Stat& commands are ?-d@$a. meaning that they store results that may bp accessed using r l) with the appropriate name inside the brackets. F~tirnation com~umriu store the resldts in e 0 . To find out under what names T ~ ~ u I ~ ~ H arc stored, see the "Storcd Resrdts-' &ion for the mrnmand nf illtcrest in the Stala Rcfmnce !l/lanuralir. Alternatively. csecut a the command md the11 Issue the mrnmmd return list for a list uf all results shred in r 0 or ereturn list for a list of all r d t s stowd iu 8 0 .

If d bcd macro is dcfiried without us%g the = sign: anything can appcm on the rightrhand sidc m d typing the local macro m c in single q~lotes htw the same c&ct as typing d~rrttuer appeared on the right- hand side in the clefinii.ior> of the mmm. For example, if we h a w a variable y. w can USP the oommmcls

Local m a w s are only 'iisible" within the do-filc or plhogmtl in which the:. we defined. Glotral macros m y be defined usirlg

Page 46: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

global a = 1

and a c w d 11s prefixing them with a d o l h sign. for cxarrrpl~,

gen b = $a

Someti~nr,s it is useful to RHVC H. gerwra1 set of rommnds (or u nrogra~n) t,hat may be applicd in diRcre111 situations. It is then esserltial -hat variable names and pwnmetcrs specific. to the application tal 1)c 3awA tn the propaln. If the r~mrr~aud.s arc! stclrcd in .a dsfile, rh .-argi~mcntti" with which t l c dvfile will be uwd a ~ : refc~*r~d to as ' 1 - , ' 2 ' etc. inside QIC dr~file. For ex~rnpl~ , a clo-fil~Jfel~um~.dO cn~ltaining r he commanrl

list '1' '2'

ro muse xi aud x2 to be listed. Altmnat,ively, we can define n program n-hich can be caller1 without using the do cornninnd in ulucl~ the sanlr K H ~ as Stata's uwri ~umrriands. This B dunc hy enclosixla t,he sPt of roimnands by

program ppo.panH end

.-Ifin. ru~lning the program definition wc can n111 (he program 1)y typ i~~g :he progrttui uauic md argull~cnts.

5lmt prop,-mns require tlrinw tto be done repeatedly by louping ihmugh xorrlc list of objects. This can bc ad~ieved ~wirig f oreach mid lorvalues. For eexamplc, wc define a propdm ral icd mylist thel lists ihc first three obsr~vabions of each wrifitbtllc in & variable list:

program mylist version 9 .2 syntax varlist foreach var in -varlist- I /* outer loop: variables */

display " ' var * " forvalues i=1/3 I /* inner loop: observations */

display 'var *C'i'] 1 display " ''

>

Page 47: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

We can run bhe program using the co~nmm~d

Hare the syntax rmnrnmand dttfines the syntax to be

(no options d l ~ d ) , issues MI error nicssagr. if varlbt i s ntrt ~ J i d , for exampJs if my of the variables do mt exist, and places the variable I ~ R ~ I C S into t,he local mwrn varlist (see help syntax and [PI syntax). TIE outer Poreach loop repeats ewrytl.iitlg with111 thc outer hrcw for each varialde in varlist. Within this Inup, the "current" variable k placed 1x1 t11c local mecrtrl, var. For cwh variable, t.11~ iriner f orvalues loop repeals the display uommad br i equal t o 1, 2, and 3.

A progrm may be thfiricrl by typing it into the Commands win- dow. However. thls is alrr~ost newr dona in practice. A more useful method is to define the program wii.hin n do-file u,h~ze it can easily he crhttd. Note t,hat once the program ha b m loaded into memory (by running thc program command), it ha.s to be cleared from memory using program drop before it can l)e rdcfincd. It is thwrfore uwfill to have thc r~mrnand

capture program drop mylist

in tlic do-file before the program command, W~PJC capture wsurw that the <lo-file continues running even if mylist does not .yet cxist.

R program r n q also be saved in a sepuale file {c~mtaining only the program definition) of Ihe same ~~arne ns the p r o g m i t sd t and ha~bing the extension .ado. If' the adrhfile (nut.omatie dwfila) is in z directory in wtiirh Statti looks for ado-files, for ~xanlplt. the r:urrcnt directory, it can bc executed ~1imply by typing the nmne of the hie. Tl~ere is no need to bad the program first (bv running the program definition). To find out where Stata looks fur xdn-filps, type

This lists vs.rious directories including \ado\personal/, the directory where pwsrmal du-files may 1,s stored. Muiiy of Stata's own cornrands are actually ado-files stored it1 thr ado si~l~diir~tory of the dirwtnrry where the Stata executable (e.p , wstata.exe) is located. The [PI Propn~mmtny Reje~mce Atanncal g im detailed information

on t,he progrmnLing~; commands mentinnet1 here md many more, Type help dialog programming Fnr information on progamrning your m n dialup. The Stata Plugin Tntcrfice (SPT) wi~ich allows corr~piled C- pragrarn.9 tto be called from Statma program is dcacrihcd in ctetail at

Page 48: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

- - A Rrsef I rahd~~ctauri LO Ssatn a 39

http//warw. stata. com/plugins. Whilc this is usefil1 if thc C-prugrmn a l rcdy exisls, it will ofLcn be easie-i to write filrictions in Malo, than in C.

Clwpter 13 of lhis hook gives mrnc exmlples of maximizing p u r on71 likelihood using the ml comx~~and, and this is discusscrl in detail in Gould e t rsl. (2006).

1.12 Keeping Stata up t o date SrataCurp cnntinudIy updatcs the current wrsion of Stat& If thc rnmputcr is connetted to the Tnt~rnet, Stata can be npdatcd by issuing rhe rtm~mand

update a l l

.Xdo-fik~ arc Chea downloaded and stored in the currcct drre~%ory. If the c~wlltablc has chr~~gcd sinm the 1m-t update, u ncw ~ m c u t n t ~ l a (t.~., vstata. bin) is also downloaded. This f la should he wed do ove~:mait~ i h ~ old exccutal)le (c.g.. wstata.exe) aft~r saving the intter nndcr x- uclv l i m e ( c . ~ . , wstata. old). A quirk and &%y way of xr:hieving this iu yo issuc the rn~n~lland update swap wit.hi11 Stata. The cornmid he lp i-hatsnew list,% all Lhe charges since the r~lensc of the p r a n k version '3; Stata

In ddit , io~r to Stata's official updat-w to tlw pnckagc, users arc con- -inuou.slp creating md updating thdr own ooinmmrls and making them ar-ailablc to the Stat& community. A r t d e s an usrr-written progranx 2rc p~ihlislied in it peer-reviewed jourr~al callwl The Statu. Joumd (SJ) n-hidl replaced che Stato T~chniunl Dq~ile tm (S'TB) BL the end ol2001 axld is indexed in t,he Scicnce Xr~dex. Tllec and 0 t h user- v i t t c n prugrams c~ri bc dm1oar ld by clicking inLo Help --, SJ & User-written Programs, and salm:tir~ one uf la iiumnhe~ of sit% iri- cluding xitcs for the SJ and STH. A Iarge wpos i to~ fclr user-written Grata proviurls is the Slatfstzcd S O ~ ~ ~ ~ J ( A T C Coapuraenta (SSC) srchive dr I l t t p : / / ideas. rePEc, org/s/boc/bocode . btml rriaintnined Kit Raiirn (the arshiw is part of 113EAS which urn the RePEc dat8bssc). Tlrest! prograrrlv can he downloadorl usin: thc ssc corr~inxnd. To fncl otit aboirt co~x~mands fur n particrllar problem (user-written or part of S t a t ~ ) , use tlic findlt cornma~id. For wnmplt., ru~lni i~g

br111gs np the Strtl;a Viewcr with n long lisl of entries includi~g o w un STE-42: 3 - 4 2 sk16.1. . . . . . Hew ryntax and wtpt f o r the mara-analysis mmmd

Page 49: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

(help ineta zf installed) . . . . . . . . . . . 3. Sharp end J . S t m e 3/98 pp.6-8; STB Reprints Uol 7, pp.106--108

rvhich r ~ v c d s that SfB-42 1las ndkectory in it cdlcd sbel6.1 containing files for &New syntax and output for the rnPta-mal~is command" m ~ d that hdp on the new rommrvnll m y be found umy help meta, but u ~ l y nftw the progrml has bcm inlctalled. The authors arc 5. Sharp mid J. Stemne. The command can be installd hy clicking intu the corrmpoltding linerlink in thp Stata Viewer (or t l lro~qh Help + SJ & User-written Programs, clicking on STR, thc11 stb42, then sbel6-1) and selecting (click here to install). Tl~c pro$ruJn cnn dm l ~ e instdid using the c~tnrnands

ast stb 42 sbel6- i

(sce hslip net). Note that f i n d i t first lists progms that hsvc hecn puhlishod in the SJ and STB, fol1owd by program^ Frrm ather sites such iw the SSC. This order oftcn does not reflcr.t LIE (rwcrse) chronu logical mdcr of vemions of a given program sjnce the SSC arually hns the most np-to-date vcrsion (look for raPEc in 'hc URL). The most re- liable way of installing a program from t,hc SSC is using the wnimmd (hmc illustratd for the progr~m gllamm)

SSC install gllamm

Sw help ssc.

1.13 Exercises

1.1 Some data manipulation

1. Use R text editor (p..g., Notepad. PFE, or the Stata Defile Editor) to gcnerate the drttnsrt test .dat given hclow, where the colilrnns ,are sepnr~tcd by t a k (make sure to saw it as a tcxt only or ASCII fils).

v i v2 v3 1 3 5 2 16 3 5 12 2

2. Read thc data into Stata ~ d n g inshset (sce help insheet). 3. Click into thc Data Editor and iypc iri the mriablc sex with

values 1, 2. mid 1 4. Define valut. Iabcb for wx (1 =male, 2=f~rilale).

Page 50: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

5 . Use generate l o generate id, n. mbjoct index (from I to 3). 8. Use rename to renmnc the variables v l t o v3 to time1 to

times. Also try doiug this in a loop using f orva lues . 7. Use reshape to convert the dataset, to lorig xhxpc. 8. Generate a variable d that is equal t o the sqi~ared djff~imm

bctnwn the variable time at each occasion aud the nverage of time for ezch subjer:t.

9. Ihvp the okrvatition corresponding t o the third occa~ioit for id=2.

1.2 Wage increases

1Iwc we 11se panel data For 545 American young mlcs t a k ~ from lhe National Longitudinal Survey (Yorrth Sample) for the pcriod 1980-1987. Thc data have bwn u~aly!~cd by Vella and \~crheclc (1098) and Ifboldridge (2002). The suhwt of wrinbles in wagepan.dta considered PIC axe:

a year: cnlr~ldor ycnr 1980 to 1987 w laage: r~at~nral Ing of hourly wage ill US H black rlunlmy vxrinhle for being Mark a hisp: durrmy wriablc fur being IIisp,mic

1. Crsnle a n e r vnriablc eclual to t,lw q o n c n t i a l of luage. 2. Collnp~re 111~: data to ohthin the rr~can homly ~ a ~ c s yvar

aitd c(hnic/racinl group (black, IIispanic, 0 t h ) . 3. Produc~ a line graph (using twoway line) showing t h ~ rrlcan

num aver time, separtratcly lor the rthnic:/mcial group^. 4. Improvr the graph by defining Ishcls. line pat t ~ n l a , Igcnds.

etc., llsi~lg the G (:I i F prcl'erred.

1.3 Finding information

Without consr~ltil~g thc manuals, use Stata's help facilities and/or GUI ta find out the follmying:

I . Thc name and syntax of tllc Stata funfiion that calcltlatcs the i~lverse cnmulat.ivc F dist,ributiuri

2. The name of the St.ate rornrnnrid tn produce x L O ~ i E S S c u m 3. Thc clplinr~ for the regress r:onzrn?ml that \+ill cause Ill*

intercept tn hc onlilted 4. Hmv 10 gel adjusted inaans for agivcll regr~xqion model ((Hint:

this is a post,-mtimittion prol~lml) 5. Bow lo plot a hi~togram with a clashed normal deasiv curve

supmi ~npofied

Page 51: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 2

Data Description and Simple Inference: Female Psychiatric Patients

1 Description of data -. . ,ip data t o he used in this thapler consist, of observations on 8 mriahles : - 118 feninle psychiatric patie1:111s and nre available in Hm~d et al. (199.1). - . ?t. vnri:thlcs are as f011~s:

m age: age in years w iq : intelligc~~ce smrc H anxiety: anxiety (-1 =lloric. 2=mild, :<=moderate, .l=severe) m depress: depression ( I =none, 2=mi Id, 3=modcmte, 4=smme)

sleep: ?tin you sleep normnally? ( l = y ~ , 2=110)

# sex h a ~ c you lmt intcrcst in sex? (3=na. 2=yeses) m l l f e: 11aw you r,I~ough:ht recently aljont endiiy your life?

(l=rlo, 2=ves) weight: incrca?~ ixl weight over lest. SIX montlls (in 1 1 ~ )

' *data are givcn in Table 2.1; missi~,g values arc mdml as -99. There

I -- a r ~ r i c t y uf questiorls that might lw addrewed hv thcse r l ~ t a ; for --?i?plc rlo women who hwc r~curllly contenlplat~d suicide dilrcr in - - reapccts from those who I~ave rnot? Also of ~ritcl.~st arc the COJ'YP~R-

+-- I>ctweprl ;w~xid> and dcp~ m i o n and hetw-een wel$~t changc, agge.

Page 52: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

md IQ. It sho~dd be noted, however. that ;any w%~ci~tions found from cmwsectionnl o k m t i o n a l data like thwe we at best suggmtirw of cauwd r~lzat,ionuhips.

W e 2.1 Data in f em. dat id age IQ a m depress slcep Rex liQ weight

1 39 04 2 2 2 2 2 4 0 2 41 89 2 2 2 2 2 2.2 3 42 8" 3 3 5 2 2 4.0 4 30 99 2 2 2 2 2 - 2 6 5 35 94 2 1 I 2 1 -0.3 6 44 W -99 1 2 1 1 0 9 i 3 1 2 2 -69 2 2 -1.5 8 3 87 3 2 2 2 1 3 ; -3 .x -99 3 2 2 2 2 -1.2 10 33 92 2 2 2 2 2 0.8 11 38 92 2 1 1 1 1 -1.9 12 31 94 2 2 2 -XI 1 5.5 18 40 91 3 2 2 2 1 2.7 14 44 86 2 2 2 2 2 1.4 15 43 !MI 3 2 2 2 2 3.2 16 32 -W 1 1 1 2 1 -15 17 32 91 1 2 2 -99 1 -19 IR 43 X'L 4 3 2 2 2 8 3 18 46 r(G 3 2 2 2 2 3.fi 20 30 88 '2 2 2 2 1 1.4 21 24 97 3 3 -9 2 2 -9.0 22 37 96 a 2 2 2 I -99.0 23 35 95 2 I 2 2 1 -1n 24 45 87 2 2 2 2 2 6.5 29 35 103 2 2 2 2 1 -2.1 2G 51 -99 2 2 2 2 1 -0.4 27 32 91 2 2 2 2 1 -1.9 28 44 87 2 2 z 2 2 3.7 211 40 PL 3 3 2 2 2 4.5 40 42 89 3 3 2 2 2 4.2 31 36 '32 8 - 2 2 2 -99.0 32 42 84 3 3 2 2 2 1 7 33 46 44 2 -%I 2 2 2 4.8 .X4 11 92 2 1 2 2 1 1 7 36 40 96 -99 2 2 2 2 -3.0 G 39 06 2 2 2 1 I 0.8 37' 40 86 2 3 2 2 2 l..5 38 42: 92 :I 2 2 2 I 1.3 as :15 ~ o a 2 2 a 2 z 3.0 40 31 82 2 2 2 2 1 2.0 41 33 92 3 3 2 2 2 1.5 42 43 8[1 -99 -99 2 2 2 3.1 43 37 92 2 1 1 1 1 -99.0

Page 53: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Table 2.1 Data in f em-dat (continued) 32 88 4 2 2 2 1. -99.0 34 na 2 2 2 2 -9 0.6

Page 54: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

mble 2.1 Data in fern-dat (contintled) 92 92 98 2 2 2 2 2 -0.3

2.2 Group comparison and correlations The data in Table 2.1 contain z numbcr of i ~ ~ t e w ~ l scale or continuo71s tarial7la (weight chaxlge, age, and la), d i r b a l variables (anxrety and dcprcmion), anti de'&otomo2bs variables (sex and sleep) that we wish to rmnpare betw-ee~l two groups of women. t h m who have thought about priding thcir I i w m d thme who ha% not.

For interval scale vxrjahlm, the most cummon statisticd tmt is the t-test wKih assumes that the obscsvutions in the two groups arc inde- pendent and are samp1erl from two populations cat:ll having R normal distribution and cqlml vnr in~~ws. A r~onparnrnetric alternative (which does not rcly on the lattcr two nmu~l~ptions) is the Mrnwmitr~eg U- tc3t .

Wr ordi~lrtl wriablw, either Lhe Mmn-Whitney [/-test ur a chi- S ~ ~ T P A test may be appropriate dcpexiding on tllc number of lcwls of the ordinal vnriahlc. The Iatttle~ Imt c m ~ also bc used to compm

Page 55: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Data Daucripbinrt ~ n d Simple InJem~e: Fern& Pspcl~iahir P d e d ~ m 41 -- --- I

dichot.on~ous v~riables between the groups. Chntinuous w i n b l w con he correlaterl using the Pe~rwn cornla-

tton. If we arc int.ercsted irl I l~e quation whether thc correlations cliffer significantly from zero, then a hypothesis test is available lhat amurn6 ki~arinta normalitv. A significance test not making rhiu distributio~id assun~ption is also avoild~lc; it is bawd on thc correlntiotl of the ra11kxd ~ariahles, the Sp~nrnaan mnk mmllaliun, Findly, if variables have only

categories, Kecdnll's tm-l provides a ~wful measure ol wrrcla- ilnll (scc, e.8.. Sprcnt and Snleeton, 21101). klom details of t11&~ t,sxts and cu~.relltation mefliri~mta c m bc fourid in All.mm~ (1990) md Agwti 1 ?O(r2).

2.3 Analysis using Stata

.\ss~nuil~g the dnta have bwu fiuverl from a sprraclxlaheet or statiqti- cnl pturkngp, ( f o ~ ~x*ml)lt: SAS or SPSS) M a toh-ddimitcd ASCTT file, fem.dat, they can hn reitti using the i~i$t.ruclion

insheet using fem.dat , clear

Tllere itrc missing vuhlrms w h i t l ~ have hean codetl -99. Wc s q l t l o c rllcar with Stalfi's rrlissing v ~ l i ~ e code "." using

1 mvdecode - a l l , mv (-99) !

'I'lic vtlriublc sleep Iius hami clltercd ir~corwctly w "3': for snhjecl I '3. Sntll date e n t ~ y errors can be dctec:lcd using thc command I

codebook

;vlrirb rlisphv?rs informatiu~l nn all wrinldes; t,hs out.put ror sleep is ~hntvti baltrw:

sleep SLEEP

typo: numeric (byte)

range: r1.31 unite: I aaiqns valuea: 3 missing .: 6 / l iB

tabula* Lon : FrSq. Value 14 1 $8 2 1 3 5 .

.<lternatidy. I ~ W can cictcct errors using the assert comrnarld. For sleep, we3 nrould type

assert sleep=llsleep=-2lsleepa.

Page 56: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

18 A Ilarrdbook of S t a h l i ~ d Ana1yBw Using .Ttata

1 cantradiccios Ln 118 obseru&iona assertton l a false

Since we do I I V ~ h o w \\=hat the anrrcct code for sleep should have bccil. nv can rcplnce the incorrect wlue of 3 +'missingn

replace s l e e ~ . ~f ~ laep-3

In ordw to have consistent coding far "\w3 aud "no". xvc remde the variable s leep

recode sleep 1-2 2-1

and to awid confusion in the filtute. we lahel thc values ns follows:

label define yn 1 no 2 yes label values sex yn label values life yn label values slesp yn

The last three commands could also b a v ~ hccu carried ant io a f oreach loop.

foreach x in sex life sleep I label values 'x- yn

1 First, we could compare the suicidal nTomen with the non-suicidal

by sinlply t a h u h t i r ~ suitable si1n111r~~ statistics in each gmup. To obtain mcim and standard dcnirttions for the rontinuom rwrCabIes, we can use the tabstat command:

tabstat uatght age iq, by(life) statistics(mean sd) S-mq statistics: ma, ed

by categmies of: life

A inore fr)rrnal appro~di to comparing the tmv groups on say wight gain over the last six months might involve an independent sarrlplw f, tcst. First. hmevtr, n-c need to rheck whether the afisumptiuns neded for I.hc t-tat appear to be satisfied for r v c i ~ t gain. Onc run? thii c a n b~ done is by plotting the variablr? weight ns a boxplot fnl each group:

Page 57: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

graph box weight, by(1ife) box(1, bfcolor(nane) I /// boxl2, bf color (none) ) yline(0) medtype(1me) /// ytitle(weight change in last s i x months)

Figurc 2.1: Boxplot of weight by gonp.

$,ing the graph shown in Figurn 2.1. Thc g'tine(0) optiurl has p h a d a horizor~td 111ic at. 0. (Note that i n the iwtruct~onu above, the forward ;Irlshtu /// were used to m a k ~ Statma igora the liric lxeaks in t l ~ c middle uI the graph box ~ ~ r n m a n d in a drrfilc, but t h~s ~llnisld not be used in iha Stata. Command window, wherc cwmnwrds car1 wrap over multiplc l~ntts.) The group do not wrn tfi differ much in their median weight change and I ~ P assumptions for t.he &test seem reasur~abb becaiue the dlstribiltiaiw arc symmetric with similar spreutl ( h a heights wrcsent intcrqulartlle ranges).

F\k can also check the ns4unlption of ~iormdity morc formally by plrjttily: a norrrd qumlile plot of suitably defined rcsidnals. Herc the difiewucc hetweml the olsrnrrd ~ ~ i g h t changes and tlic group-spedfic man weight dmnnga crtn bc u~wd. If the rrurmalit,y asw~nyt ior~ is sat- isficd: thc quantiles of t.he residuals shoulrl be linearly rcIat,r(l to the quantilcs of the normd distribution wit,h the s m ~ c nxem and shntlarcl

Page 58: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

tlcviiition. The rwiduals c a r ~ be computed and plotted using

egen resl~epmlweight) , by ( l i f e ) replace res=welght-rss qnorm res, t i t l e ( " 8 o m l Q-Q plot") saving(qnorm,replace) ///

ytitLeCresiduals for weight change)

Thr poiuts in the Q-Q plot in Fimrt! 2.2 appealh to he snfficiently dose to t h r straight, line to justify the nomrdit,y a~~umptlon.

Normal Q-Q ploi

Figure 2.2: N o m d Q-Q plot of midimls or wight change.

Virc could a h test whether the vari~iccs differ significultly ~Lsing

robvar weight, by(life1

giving the uutput shown in Display 2.1. HF~F: the first tat, is Levma's test, and t h i s test indirtntes that there is no evidmcc that the vwiancw rlilrer (F(2,104) = 1.37, p = 0.26). The X50 tmt statistic replaces the means in 1,evene's t-t ty mediars and tllc W10 statistic9 rcplilccs the means by 10% trimmed means. All these tests are more robcjt to non- normtility than the conventional homog~nei t,y of vitrianm tmt produced hy the sdtest command.

Having found no strong cvidci~ce that the assumptions of the t-test are not valid fo? weight: gain, wc can proceed l o apply a t-test:

Page 59: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Suiance ratio t e s t

sum- of mw LIFE: 1 Maan Std. D e v . Freg.

Display 2.1

no

yea

Total

ttest weight, by(1ife)

1.4088889 2.g092338 45 1.7311415 2.8256292 61

1.5Q133X 2.727805 106

-- ..r-sampl~ t tes t vith equal v a ~ a m c e s

G ~ o w Dbs Heen Std. Err, Std. Dev. [95X C o d . I n t e r v a U

63 . L. 371606 di (2 , 104) PI > F = .2582496?

= 1 2167221 df(2, 104) Pr z- F - .30038W1 2 3 z 1.3083681 dfI2, 104) Pr > F = .27487211

-

d i f f 3 mean(no) - oees[yes) t - -0.5993 2: d i f f = 0 d e p e s of freedom = 104

Ha: difi < 0 Ha: diff !- 0 H a : diif > 0 ?(T < t) - 0.2751 Pr(lTI > f t l ) - 0.5502 RIf > t) 0.7249

Display 2.2

-?:3pluy 2.2 show that the difference in means is extimatcd as -0.32 x r l l R 95% wnfidcnct intend Trom -1.39 to 0.74. Tllc tw-tail4 p -.r.!ue is O 55 , so there is no evidwcc that the populations differ in Glieir yean weight change. (The unequal optim could be used to relax the 5:nn~pt.in.u of equd population V'~~-~HIICCS.)

?;ow suppose we wish to Lampare the prclnlence of depression be- YPpn sr~ieidnl and non-suicidal n-cr~r~en. The trvu cntcgoriral r.arinbIes - r! hp crubs-lal,t~lated ancl [he appmpriat~ chi-squared statitistic cdcu- ':-d using PI single comtnand:

Page 60: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

tabulate l i f e depress, row chi2

Thc o u t p ~ t is shuwn in Display 2.3. Here the row option was m d

frequency

DEPRESS urn I 1 2 B 1 Total

to digplay row-perccntagcs, making i t ~micr to compare the g o u p of woIncn. For example, 50.98% of non-suicidal women arc not da- pressed at all. comp~crl with O% of suicidal women. Thc mlue of Ihe chi-squared statistic implies that thcre k a highly significant asmci- ation between dcpr~s ion nnd suicidal Lhonghts ( X 2 = 43 8, d.f.=2, p < 0.UI)l). Noimp that. this t.mt doev not t 3 c ww~mt of thc ordinal ~iakurc nf depression mid is thorefore likely to hr less sensitive than, for exrtmplc, ordi~lal rcgmiuil (we Chapter 6). Since some cclh in thc c:lnw;ification have only small counts, we rrught want to IIW

Fisher's exact tcst (we Everitt, 1992) rather than lhe chi-qwed tcst. (\Ye coiild first use the expected optb~l in the tabulate r,rnnland l o dlcrk if the e x p ~ t t c d oourlbs arc mall.) The necessary curnmmd (wilhuut reprodllcirlg the tal>b) is as follows:

tabulate life depress, exact: nofreq Fisher's exact - 0.000

Again we find ~ t rong evid~acc for a rehtionxhip between deprcssiori and suicidal thoughts (Fisher's P-nxl tpstt, p < 0.001).

A 1lsefu1 displ~y for t,w-tvay tables is a bar chart whick~ can be produced as follows:

Page 61: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

graph bar (count) ~ d , /// ouer(depre8s, relabel(l "none" 2 "mild" 3 "moderate")) /// over ( l l f e , telabel(l "noa-suic~dal" 2 "suicidal") 1 /// ytltle8ercentages by group (sufcldal versus not)) /// asyvaxs percent ehowyvars legend(off)

Hcre we uscd (couxt) i d to plot the nu~nbcr of non-miwing v a l i l e of :d and two over0 optiorls to spccify thc grniiping variable7 depress r.nd life TVitlrin thew over0 opt~ons, we defined tllc labels to he ?rrntrrl for t.he categories of depress and h f e . 'h display the bnm

1 f>r thc first qoup i~ ig vari~blc In different colors, u s 4 the asyvars snrion. The asyvars option xlm allr)ws us tn use the percent optior~

1 -A convert the mnnts for thc different dcpmsion categorlcs tu pcr- n t a g e s t11c brvo soups (wicidd vcrtii~ I I O ~ ) . Fiually. we used I r i o q v a r s aid legend(off) to place tho labels lor tila cleprwsion cat- -wries underneath thc correspondir~g bars instead of, wit.hi11 a legend. Tne paph is show-11 in F i m e 2.3.

I 7:qlrc 2.3: Bnr chart of pmcentagw of women in diffrrcnt depression -aregorit;r; by group {sllicidul vei~us nonsuicidal).

Page 62: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

54 ¤ A I i a r r d h k oJ SldislimI AnaIyses Using Stah - -

We now test the null hypothesis that the proportion of women who havc lost interest in ses does not differ between the populations oC sniddal and non-suicidal women. We can obtain the rclc\rrlnt table and both the chi-squared test m d Fidler's cxact tcsts using

tabulate l l f e sex, row chi2 exact

with rwults s h - n iu Display 2.4. Thcreforc. them who have thwlght

PearMn chi2Ci) = 6.6279 R = 0.018 Fisher's -act - 0.032

1-sided Fisher's e x a c t = 0.01T

f rsqueucy

Display 2.1

tfFE

no

Ps

Tutal

about ending their livw tire more likely to haw lost interest in sex than t h e WIIO have r~ot (92% c o m p d with 76%) and the mciation is significant (Fisher's exact t&. p = 0.032).

The cormlations bctween the three va.riah1eu w e i g h t : iq, ant1 age can be found t ~ s i n g the correlate mmmond

corr weight iq age

(set Display 2.5). This rorrelation matrix has been mlua tcd For t-hose 100 momen who had cou~plete data on all three variables. An alterna- tive approach is to us^ the pucorr command to incliide, for 4) corrc InLion, all observations that haw cmplete data lor t h e corresponding pair of variables. resulting in different sample sizw fur differc~lt corrc Inlions Thew plnin1ri5c rorrelxtioas car1 bc obtaincd together uith the sample sizes and pr-alues using

SEX

pwcorr weight iq age, obs s i g

no yes

12 38 24.00 76.00

5 58 7.94 92.06

17 96 16.W 84.96

Total

MI 1N.00

63 1OO.W

113 iOO.00

Page 63: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

-. - Data DcscapLroa and Stmple I ~ f w m c e : Fe~nule Psgcl~iatrar Pntimb . 5 1

(0M.lW) I uaight i q age

I src Display 2.A).

Thc corresponding nthrttberplot matrix is obtai~ietl using graph matrlx os Irrllows:

weight

i q

ega

graph matrix wsight iq a g e , half jitter(1) msymbol(0h) /// msize(smal1) diagonal("Weight change" " IQ" "Age")

weight Iq ags

1.0000

107

-0.2920 1.0000 0.0092

100 110

0.4166 -0.4946 1.0000 0.owo 0.OODO

107 110 118

~ I P ~ P jitter (1) ranclolnly moves the poinls Ily n very small xrrlour~t -. stop lhcm overlxpping curr~plclcly due t,n the diucrctc nnt1u.c or age ; : ~ 1 TQ. The resi~lting g a p h is shown in Figure 2.4. I c scc lhnl oldrr -::icl intrlligc~it worrlcrr Icnrl to put ori Iriorc might, than younger -.::(I rrlo~o i~~lcll ignt, ones. Siwt'vcr, olrlcr \vomm in this smrlplc nlsa --lirlrtl to bc lcss ~nlc l l ig~nt so tllat t l ~ c annd inte1lige11r.c: dre corlfaurl~l~l.

I t is of some iritelsst LO asscss whether a ~ w and weight change have -:I.? s i ~ e rclalionship in xniridd as In non-suicidal worrleri. VV'C shall : rhis informdly by co~lstructing a sir~gle scatlcrplot. of wight chaug -2aillst age in wlrich Lhc women ~ L I the t \ v ~ grnr~ps RPP represented 11y :.%~.crlt sgrnhnls. Tlris 1s easily donc ky simply specifyirig two overlaid

Page 64: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 2.4: Scattcrplot matrix for weight, IQ, and e e .

twoway (scatter vsight age if l i f e x l , /// mspbaf (a) mcolor [black) jitter(2)) ///

(scatter weight agg i f life==2, /// msymbol(0hl mcolor (black) /// j l t t e r ( 2 ) ) , legend(arder(1 "no" 2 ' "y~"))

The rwulting graph in J3g1ire 2.5 sho\v~ that within both poups, higher a g ~ i s ausociatd with lager weight increws, and the groups do not form distinct "clustersn.

Fiildly: an appropriatz corrBat.io~~ hctwwn the ordir~al vasiablcv dcpwion and anxiety is Keri&all's tau-b which can be obt,zbincrl usina

ktau depress auxiety Nmbsr 05 obs - 107

Keedall" tau-a 0.2827 Kendall's tau-b - 0.4951 Kendall's score 1803

SE of scme - 288.275 (corrected for t i e s )

Test of Ho: depress and anxiety are ladependmot h o b > lzl - 0.0000 (continuity corrected)

giving a va lu~ of 0.50 with an approxima* pvalue of p < 0.001. D e p m sion ,and anxiety arc clearly related in these psychiatrically ill m i e n .

Page 65: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Ln 0 '

0

N 35 40 45 AGE

Figure 2.5: Scatterplot of weight against agc.

2.4 Exercises

2.1 Female psychiatric patients

1. Tdhulate the mean weight change by level of depression. 2. Ry looping lhroilgh the variables age, iq, and ueight using

f oreach, tabulatc the mrms and stdndard deviat,ions for each of t hcse variabll-hcs L?y life.

3. Produre n bar c h r t a~~alugoi~s to the one in Figyre 2.3 but for sex and life.

4. UQC search nonparametric or searchmann or search uhitney to find hclp on how to run the Mnnrl-Whitncy U-test.

5 . Compare the wigkt rhwgcs between Ehe two groups nsing thc Mm-Wkitriey U-teh.

6. Fnrm a scatterplot f u ~ iq and age ming different symbols for the two groups (l ife=l md l ife=2). Explore the usc of Lhe option jitter (#) for different integers # to stop symbols ovprlapping.

7 . Find the cornmmtl f i r tllc Sp~xrnan correlation corffici~nt and nsc it to find thc Spearman carrelation betwen age and iq.

8. Ha%-ing tricd out dl thcse commands interxtivclv, create a

Page 66: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

do-file mntaining these eornmwlds arlcl run thr d ~ f i l t . In the graph cornmrds, use the option saving(f ilename ,replace) to save the graphs in the current directory and view the graphs laer using the ro~r~rnand graph use fleaanbe.

2.2 Australians going metric

Shortly aftw metric units of length were officially introduced in Austrdia, e.wh of a group of 44 ski dent^ was asked to guess, Co the nearest meter, the width of the icdura hall in which they were sitting. Another group of 69 stdents in the samc room nrs asked to guess the width in feet, to the nearest foot. The true width of ihc hall way 13.1 rncters (4.l.ll fed) . 1% data corr~e horn Hand et al.{1994).

The variables ill meter. dta arc:

id : studcnt idmtifier meters: dummy tariahle for guew hcing in meters

m guess: gllG?ses

1 . Investigate, by the use of suitable graphirs, s~gnificmcc tests and estimation procedures vrhetlier therc ~s my cvidcnce of a ~ystematic difI'p.~'er~ce in the p;uaseu made in metem and tho* t nde ~n feet.

2.3 Mortality from skin cancer

Here we consirlct a dnta$ct from Belle rt al. (20U4) whjrh conhist of mortality rates due to malignant rnalrrnornu of the skin for white ~anlcs during the period 1950-1969, for each state on the U,s. mainland.

Thc variabtbles in m o r t a l i t y .dta me:

I state: name of the state mortality: rnortnlity r a t . (in deaths per 10 million per year) Latitude: htiturle of thc center of ewll stmate longitude: longitude of t,ha center of thc state population: population (in millions)

m oceaa: dummy variable for state being m n t i ~ o u u with an

1. Constru~ct some suitable gaphjcs for investigating how nlor- tdity is rel~tcd to latitude ancl longitude and how any re.+ tiol~ship between tliw va~iablcs k &ectd by- being an ocean

Page 67: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

llmtn Dcscriphnn and Simple Injcwntx: Fcrrzal~ P~grchialnl: Pdeents . 59 -- - -

state.

2.4 Invasion of acacia trees by a n t s

The data in the 2 Y 2 contirigcnq table he lm (Dom Soh1 md Rahlf. 1881) record the resdts on iarl experiment with acacia ants. All hut 28 trecs of twu species of wru:ia (A anti B) were cienwd Srom an arca jn Central Arncl-ica, and thew 28 tree w r c cleared of ants using insecticide. Sixteen colunies of a particular species of mt were obtained from other trces of species A, The color~ics vmc placed roughly equidistant from the 28 trees and dlmved tn invade them.

1. Produce a table cont.ilining t.he pcrcentagq of trew i~lvaded by tree type and the cxpwted fraqilencies nnder the m1l1 Iw- prrthesls that tllcrc is no assot:iation batwccn type or tree and ~nvrtrion by ants. {Hint: mse &he tabi comrr~mid to specify the frcquwcies within t l c St,ats comrn~nd instelad of entering t h ~ data.)

2. Invetigate whether therc is any evidmcc that t l ~ e invwion probability differs hetwecn two specics of macia tree.

3. Obtain a11 apprmrirnate 99% wnfidence interval for the rele- vant difirence in pmportions (Wint: Use the c s i cornmad).

2.5 Sexual satisfaction

Hout, Duncan and Sohel (1987) i r~g t iga ted Ihe rehtivr wxual satisfiw:tion of married couples, by asking cach manber of 91 married couplcs to ratc Ihe riegce to which they ngreed wilh the statemtent 'Sax i s fun fur me x11d my partner" on a four-point scale ranging fro111 " n ~ w r or occasionally", to "almost always". The ddtca far 3U coilples are &n in satisfactlon.dta. The v-d*.jahles arc:

m couple: couple identifier II husband: satisfaction score of hushand m wife: ~atisfnc:~ion score of wife

1. Carry o u ~ an appropriate significance test tto investiiatc whether there i s nny evidence thnt men a r~d women differ in thdr mean yexual satisfactinn.

Page 68: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

2. Construct, a crosdahul~tion of husband's and wife's s~xual satisfaction and calculate a suitable mcaurc of conciaLion between the twu ratin@.

3. Construct a 95% m~lfidcnce intc~val for the true mean differ- ence Zmween the 8 e ~ d salisfat:tion cccoxes of hasbai~ds and wives.

2.6 Crowd reactions to threatened suicide

him (1981) condllcted a study to jnvmtigizte the ctluscr; of jeer- ing ur baiting behavior hy a c r d whari a persun is threatming to commit suidde hy jumpiw from a high building The dstn given helow rmult. From the classification of thmtencd suicides by two factfirs, tlie time of year and whcthcr or 110t baiting oc- currd.

Baiting non-beiting JuwScptcmber R 4 Octoher-May 2 7

1. A hypotllesis is that baiting is more likely tcj occur i r ~ warn weather (The data come from the norther11 hemisphere, EO

JuneSeptember are thc warm month). P r o d ~ x ~ a table of conrils and percentages tor assessing this hvputhwis. (You

use the tabi command which dowr, you to hst the r!li frequency i11 t h ~ . com~iand, rather than rater i~ i~ them nx a dataset.)

2, Produw a table of m c p a t ~ d frequencies under the null hy- pothesis that there is no arsor,iilt.ion bHwecn wasorl mid bait- ing behavior.

3. Twt the md1 hypothesis that thexe is no association bct.mwer s e w n arid baiting: bchavior.

Page 69: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 3

Multiple Regression: Determinants of Pollution in U.S. Cities

3.1 Description of data Daln o11 air ~)olll lt io~~ ill 41 (J S. C ~ C ~ C H WPI'B UOIIPC~C~ I I V S ~ h d nn(l Ijcrl~lf (1981) Il.nin x c v ~ r d U.S. gnvrr~~mrr~t pul~licntiol~s arid nrr repro- m?ticcd IIPTC in Tai~lr 3.1. ( T ~ P rfnla arc also gjw11 it1 Hnnd ei! a!., 1894.) Tttr~c is a singlr dcpendci~t. variable, 802, tlic nrll~t~nl Iricnit cu~irsntm- oil nFsulpliur dioxidc, in ~nkxogra~~is pw cnE)ic mertc~.. Thcw data arc .rlcans for I hc ilirrc ymrs 1969 to 1971 for om11 city. TT1 vvelurx of six +~spl;~ria~ory vnriablcs, two of which coriwrn I ~ u i n ~ r ~ ccolopy ant1 ffl111'

Iimxtc, are nlsn sccorrlvd; detnils a1.r a.\ Iblluww:

I R tamp: ~ V C I ' R ~ C ! ni~l~tlnl t.e11:11peratul'e 111 nF ' manuf: ~,arnbcr cd ma~~ufacturjng cnt,rrprisps crnplwing 20 tx

m pop: poplllntion size (3970 CCILSIIS) in thnulsnnrls w wind: nverilgc nr111ual wittd apwdl in mil- per hnrlr

[ days: averfigs nurnbcr of days wit,h prer:ipitation pttr y c i r

Tlie trlnin rlucst.iu11 of iritcr~sl abutit t l ~ c s ~ deln is l o w thc pr>lluLinrl -rcl ~7 rr~rawlrtl by s ~ ~ l p h ~ ~ r dinxidc conc~nt~mtion i5 det!termined :div

-:I? six 5xplwntoy vniiablrs. The ccntrd inathod uf analysis will be - llltlplc wgT.CSSaOPL.

61

Page 70: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

62 H A Handbook of S!ukstid Arsdper; U . h g SWu - - - - - - - - - -- - - -- - - - - - --- - -

'Igble 3.1 Data in usair-dat

Town SO2 temp rnanuf pop wind precip days Phwnix 1U 70.3 213 582 6 0 7.05 36 Lit Ile h c k 13 G1.D 91 132 8 2 48.52 100 Sari l+anc~sco 12 56.7 453 716 8.7 20.fiB 67 Dpnwr 17 61.9 484 -515 11.0 1235 86 Itartford 56 49.1 412 158 D O 43.37 127 Wllrmngtm 36 54.0 80 80 9.0 40.25 114 Wahkirrgtun 20 57.3 434 157 9 3 38.89 111 .lackson 14 08.4 138 823 8.8 54.47 LIB Nimi 10 75.5 207 33.5 9 0 -59.80 128 Atlsntn 2.1 61.5 388 497 9.1 48.34 11 5 C:l~~cxm 110 50.6 3344 3369 1 . 4 54.44 122 lndienapolix 28 52.3 1 746 9.7 38.71 121 DPS Moisk; I7 49 11 104 201 11.2 81.85 la? Wichita 8 66.6 125 277 12.7 30.58 82 huiwUle 30 55.8 29 1 593 8 3 43.11 123 Ncw Orbans 9 68 3 204 881 8.4 58.77 113 Ralt~mnrc 47 55 0 62.4 905 9.6 41.31 111 Detroit 35 49 9 1W4 1513 10.1 30 96 129 Minneaplis 29 43.5 fj8Y 744 10.6 25.114 137 &ns% 14 54 381 507 10.0 37.m) 99 St. 1.uuis 56 55.9 n5 62'2 9.5 35.89 105 Ornsbo 14 51.5 181 347 10.9 30.18 98 Alt~uquerqut! 11 56.8 46 244 8.9 7.77 58 Albany 46 47.6 44 116 88 83.36 135 Buffab 11 17.1 391 463 12.4 24.11 166 Ci~rciun~li 23 54.0 462 153 7.1 3904 132 Clcvelw~d 65 49.7 1007 7 1 10.9 34.W 156 Colu~nbia 26 51.5 2% 540 8.6 37.01 134 Yluldclphia fig 5 4 6 1I182 ID50 9.6 3'393 115 P~ttstlurgh 61 50.4 347 520 9.1 313 22 147 Pruv~rlcnce 44 50.0 343 170 10.6 12.7.5 125 Mcmphhs 10 81.6 337 624 4.2 449.10 205 Nashville 1% 5g.4 275 448 7.9 46.03 119 Dallas 9 6G.Z 641 84.2 10.9 35.Y4 78 IIountun 10 88.9 i21 1233 10.R 48 1s 103 Salt L&e City 28 51 0 137 176 8.7 1R.77 89 Norfolk 31 593 96 308 10.6 44 68 116 R~chrrumd 26 37.8 197 294 7.6 42.59 115 Sattle 29 51.1 379 531 9.4 .% 79 164 Charleston 31 -55.2 35 71 6.5 4078 148 Milwaukee I6 45.7 56g 717 11 8 29.07 123

Page 71: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

' 3.2 The multiple regression model ( The n~ultiplt: rcgressior~ modd has the general fnrm

a here yf k a c~itinuous resportse (or dependent) wiFLhlc for the itb =ember of the sample, a,,. z,,, + - - , ..c, arc a set of exp1anator.y (01. ih- etpendcnt,) variallles or crwnri~tm, fro, B1, A:. . . , [$ are rcgrmion co ;5cicnt*, anti rt is a residual or error term with zero rncan that is un- 'mrrelated with the cxlslarratvry variables. I t fol1o~1.s that the cxpecwd i - a l i ~ ~ of the i*espunsP for dvcn vdues of the m ' ~ v i a t e b is

r h w c 4 = ( s l i ! . . . , x,,). This is also the value IVC would prcdict for 4 new indirrid~lal with covariak te lue~ x, if we kncw tllc r e g d o n wefficiants.

Each r ~ g r c ~ ~ o n coefficient represents the mean change in the re- 3 0 1 ) s ~ mrjilble whcn thc clnrefiponcling explanatory variat~b incrcasw 51- onc unil and all other cxplmlatory miahla m a i n constant. The rucfirients therefore represent t,hc effects of each explanatory variable, mtro l l i~~g for all rlthcr wpliiaat.tnry variables in .the model, giving rise -0 rile term "partial" regr~ssion cocfidcnts. The r ~ i d u a l is the dif- Prence I>etwccn the! observed value of the response and t lc ~xpc~ted sxlrle b d on the explmlttory varia1,les.

The regressiori coefficients $", . . . : $-are gm~erdly cstirnatrll by llcast :r~uaws: in otlwr words the =timate & . . . . , minimize the surr! of -he ~q~yuered diffexmces b c t w n observod and predic:tcd rasponses, or -he sum of squard cstiniakd rwiduah,

Sg~ificik~iec: tests for the regrerjsion coefiients can bc derived by ; t ~

;ruliing that the error terms are independently narrndly distributed mtll xera mean and mwt.mt variance v'.

For n ohemations of the rexponsc and explanatory variabics, the reg~.es?ion rrlodel may be written cnncis~ly a.9

~ l i c r e y is the n x 1 vwtor of mponss, X is an a x b-!- 1) matrix of known cr)r~stants, the first cnlumu containing a series of ones corre-

Page 72: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

spoildil~g t r ~ thc term ,& in (XI), and the remaining u~lumns d t r w of Lhc explanatory vxiahlcs. The cle~ner~ts of t11c v&or 0 arc thc: r e g w $ion c:o~ficicnts /&, . . . : &, and those of Ihe &or 6, thr error tcrms €1, . . . , E,. Thc lear.t, bqunres astimaks of the regression cocfficirats can then he written RS

mid the varinr~ces and maia i~ccs of these cstirnates can be foilnd from I where s% is mti~nate of t.he rcsidunl variuia. ma g i w by thr sun1 of .quared estimated residuals in equation (3.2) divided by n-p- 1.

The ooafficieut of dctcrminatmn, n', reprcseuts the portion of thc t o t d varianoe of t l~c response variable t h d is explained by tthe explnrm tnry variables. AItermtivcly, it can bc inttsrprelcd as the proportional reductior~ in prediction error variancc of t h ~ model ctsrnparcd with the ron.sLar~t,-un1y modcl (without cuvariaLcs). R, dso knuwn as thc rrtulti- p l ~ rumhtzon coeficisnt, is just the correlation bctween the o b s c ~ d rcspames yt and. the predicted rsponses g,. 'For full details of mltjple r e g r k n n me, for ~xmpIa: Rawlings ct aI. (1998).

3.3 Analysis using Stata Asmuling thc data are avauaihble ns an ASCII file usair-dat in the eurrmnt dircctury and that thc me rmntains city names (abbreviated versions of those in Table 3.1), t h y may be read in lor analysis usiug t,he foIlo-iving inst,rt~ction:

in f i l e strl0 'tom so2 temp mmanuf pop /// wind precip days using usair.dat, clear

Here mr: had to dcclare t,he "typc" for t,hc string variable t o m a.9 xtrlO which stands for "string variable with 10 h m t c ~ s " .

Befort! undertaking a rorrtral regressior~ analysis of thew data, it will hr: hclpful to exmnine them graphically using a scattcrplat matrix. Such s display is i~efu l kor mms ing t l ~ e gencral rekitionships between Ihc rmrial~lca. for identifying powiklc u~~t.lie~s, ar~d for high1i:Itting potrnlizl coili~learjty problernt; s inonst the explanatory mriublcs. Thc bwir: plot can bc obtained usi~ig

graph matrix so2 temp manuf pop wind precip days

Page 73: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Mulftpie R e g i ~ m w n . Detp.rmanaah of Pnlbr t ra tn U.S. Calaes m 65 - - --

Figure 3.1: Scatterplot matrix.

The rwulling diagran is bhown in Fignre 3.1. Sevcral ol' lhc scntter- plots &ow ~virlanw. of oullicrs, mid t l ~ u relutiouship betwen manuf and pop is very strow suggesting thal using both as explanatcry vari- ables in a rcgessioi~ M L ~ ~ Y S ~ S may llrfirl to problems (see 1at.e~). Thc -elaliorlsl~i~s of pitrtic~llar interest, namely thosc bclwccn so2 and the t.?iplanrtt,orp varial)lm (llic relevant scatterplots art: t.ht~9.q~: in the firsl :on- of Fignrc 3.11, indicate sorne pusbible nunlinearity A morc! infor- native, allthough sliihtly more "messy" dingmm can bc obtained if the 3 ln t t~ l points arc labeled with the associat,.t,~rI t ,wn namcs. IVc first reate a variable corlttairling the first t-hree characters of (IIC string^ in

:o-m using the hlnr:t,ion substr0

generate tun = substr(t0~n,l,3)

\Ye t,hcn crmlc a ~catterplut matrix with the= threerhnract.~~ town : 4,els using

1 graph matrix 602-days, msymbol(none) mlabeh(tun) ///

mlabposition(0)

Tne mlabel0 option In bcls lhc points with the m e $ in the tun vxi- - 1 1 ~ . Uy default, a "markcr s?;rrllol'' wuuld albo be plotted md this ill he suppr~sx~d tlslng msymbol (none) : mlabposition(0) centem the

4 s where the smhr~l noimlly g The resulLi~lg diagrur~

Page 74: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

appears in Figure 3.3. ClmIy, Chicago and to a lesser extent Philadel- ph~a might be consider~d outliers. Chicago h a sud~ a high degree of pollution cornparod with the other citics t h t it, should perhaps be rnn- sidered as a spccid cwsp tuld excluded from further arldysu. 1% can IWUVF Chicago usin:

drop if town=~"Chicagol'

Thc wrnnlnnd regress may be !used to fit a bmic multiple r egm~bn modd. The nccesrjarp Statz command for regr%qing sulpiiur dioxide conceritratiun on the six explanattory varjahles is

regress so2 temp manuf pop wind precip days

OT, dternatiwly,

regresa so2 temp-day6

(sec Display 3.1).

Source Number 01 oba - 40 F( 6. 3 3 ) - 1.20 Prob > P - 0 . O W R-squared - 0.5297 Adj I - s q u a w = 0.4442

15485.3 39 397.074359 Root W E - 14.855

:E ~ o e f . x d . ~ r y tp P>l%l -[WY, Gf. Interval1 -1,268452 .6M5269 -2.01 0.052 -2.661266 .01U631

manuf .0654927 ,0181777 3.60 0.001 ,0286098 ,1024755 -.039431 .0516942 -2.54 0.016 -.07i0357 -.OD78264

wind -3.188267 1 859713 -1.72 0.046 -6.981881 ,5853489 precip ,5135846 ,9887273 1.39 O.173 -.2364986 1.263866 days -.0632051 .I653576 -0.32 0.750 -.3891277 ,2832175 -cons iil.8709 48.07439 2.33 0.026 14.06278 209.679

Display 3.1

The mdn fcatures of i n t ~ r ~ t in the output in Display 3.1 are the analysis of mianr* table and tkc parameter estin~at%. In the for- mer, Ihe rntio of the model mcnn square lo the residual nicm square gives an F-test for Ihe hypothesis that all the rcgrafision coefficients in the fittcd lriodel arc zero (except t.he constauut Do). The resulting F-sthtistic with G and 33 dcgrme of frecdnrn t b the value 6.20 and i s s11m-n on the ri&t-liand side; file amciaierl y v d u e is very small. Consequently, the hypothesis is rejecterl. The square of thc multiple

Page 75: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

corrclat,ion oorfficjmt (P) is 0.5:1 shmving t,hat 53% of the vvarialce of sulpliur liioxitle conccntmtbtlo~~ i s accounted fur by the six explmmtory lariahies of interest,.

The adjusted R2 stntbtic is an estin~rtt~ of I.hc pnpr~lation R2 taking account al' the k , t that tile p a r a t e r s were astimai.erl from tlrc same data for which R2 is crqluatcd. Tbe ~tut~istic is calculated RB

nllere n is the number of uburrmtiunu used in f tiding thc model. The rnul kiSE is simply the sqlrsre root of the residual nwan square in the analysis of variance talde. which itself is a11 estimate of tlic parameter 7'. 'n~e wtimztccl regrcssitm cocfficient.~ give the e~ t~ in i a t~d change !n tilt menn of thc response varinbl~ produced by a imit d ~ u ~ g in ~ I I P ~(>~resporkding mplauatory ml'iab1c with the remaining explanatmy x~~rinblpaj hclrl coristnnt..

One rmncern generated by the initial gr~phical material on t,his ~!ara was the st,rong rclutio~whip betnwn the two e~plarmtory vai-ishles zanuf and pop. 'nw corrclatior~ of these two wriaM~s is ohtnined by > ~ i r g

correlate mmuf pop abs-401

0.8906 1.0000

'rhr strtlng l i n~u r dependence might Zlc a .sollrrx of couineuity prokg :ems mid can be investigptsd furt.hm by cnlc~llating wtlat m known RE

arinnre z'rafiutiora factors for each of the explanatory vruihleq. These s . r ~ giwr~ by

-,-l~rrc VIF(xr) is the variance inflation factor For explanatory wriahle . find R$ is the square of the ~nultiplc correlfition cotmcient obfaincd

=rim regusxirig rk on the remaining explmintoy variables. Tllr \mi- ,.!ire inhtiorl factor reprcscuts tllc squar~d standard error (or sn~nplirig ,a l ia~~ce) off?,, i n 2.h~ rstirnstcd 1node1 divided the squitr~l standard -rror that would be obta!ncd if weir: uncurrclatftd with ~ h c rcmdning :ariables.

Page 76: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

68 H A Handbuok of Stah~lieul Analyses Usmy S E --

Page 77: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

TIie variance intiation factclrs can be obtaincd usiq thc eeatat v i f command after regress:

estat v i f

[we Display 3.2).

Yazibble

0.2G9158 0.287862 0.293125

wind 1 . 2 0.790614

Mean V I F 4.05

C:htterjee et al. (2COti) give the foUorving 'n~les-of-thtlnib" ror d- uating these factors:

m Valuw larger than 10 gjm evidence of colli~lcnrity.

I A mean of th~: VIP factors mwidcrnhlr+ larger than uric suggmts

1 collirlcarity. HCIT there arc no values greater than 10 (as an ex~rcisc wc nipgwt

rmdcrs also r~lculate the C7JFs lshm llw ohsnalions for Cbicagn are ~ncludsd), but the mcan value of 4.05 some came for concern. .4 2i1nplc (~lt~l~c)ilgh not newssarily the hmt) way lo proceed is to d m p m e oi manuf ur pop. huother p~ssibilip is to rcplace manuf by a ?rrv variabic cqtrnl tu mafluf ditidcd by pop. rcpr~senting the number 3i' large mwii~fw+uriug erltcrprim per t l i o i ~ i n d inllnl>itmts (sw Fx- -rri<r 3.1). I l m x r , we sh l l simply cxclud~ manuf and repcat the -pgrcssion ar~nlyais 1~sin.g the fix% rcrnaining csplanatory vnriahles:

I regress so2 temp pop rind precip days

The n~ibpiit is H ~ O R - P irl Display 33 . Yow rwompute thc c-ariauw inflation factors:

estaC v l f

Thr r-arianc~ idelion fdcto~s in Display 3.4 me now satisfactory. Thc rmy general hypothrsis concprniug all regression codhcients

~erltiol~crl previot~sly 1s not nsimlk or g m t inlcl-rst in most appli- --ation.; of multiple r e m i o n b ~ ~ e c ~ r n it is ~t~ml. ~~dikelly hat all the

Page 78: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Wumber of obr = 40 F( 5 , 34) = 3.58 Frob > F = 0.0105 R-square& - 0.3448 Adj R - s q w d = 0.2484 R O G ~ R R S - 17.275

-1.B67665 .7072827 -2.64 0.012 -3.905037 -.4m294 .0115969 ,0075627 1.51 0.141 -.0039123 ,0267661

V r n d -3.126429 2.16257 -1.45 0.157 -7.5213 1.268413 prscip .6021108 .A278489 1.41 0.168 -.a673827 1.471604

-.020149 .I920032 -0.10 0.917 - 4103424 .3700445 -cons 155.8565 55.36797 2.45 0 019 23.33629 248.3778

Variable 1/VIP

3.46 0.28g202 9.40 0 294429 2 0.790710 1.07 0.931015

lean VIF 2.53

Page 79: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

cl~(jssn explanatory variables will be unrelated to Ihc resporLse variable. Tlw more i r~tcrwtq question i s whether a sitbejet of the mgwiorl cu efficicrlts is zero. irrlplying t,hat nnot dl t,hp explan,nntory variul>lcs nre of iw in predicting the r~~porisc variable. A prcliminltry asse.ssrnent of rlir Iikcly importaicc of each explaliatory wiahlr: can he madc using I ~ P table of estimated regcsslon mefficicnts and awciated statistics. Fsirkg n collvcntiorml 5% criterion, Ibe only "significant' ccficientm is that, for the variaZ>Ie temp. Unfo~t.~~natcly, t,hh veiy silnple approacll is not in gcr~crd s ~ ~ i t ~ t > I c , since in n~wjt GWS the cvplnll~tory vnrirtbles an! corrclirted, and thc Gtests will nht be ix~dcgcnderit of cach other. rorlsrqiicntly, ranloving a partic:ulas variable from the regression will nltcr both thc cstimlttwl rcgrcssion roeflicients of thc renixiniry vari- ables and thcjl. st,a11Clzwd CI'MPS. A ~ O T C invoivcd apprc~ach lo identify- ing impart,wit sul~sets of explanatory var~ahles ix tIiurciorr? rq~iired. A T I I I ~ I I I > ~ Y :r)f p r u c c t l ~ ~ r ~ ~ are availnhb. 1. C:onfi~rrsulo~ npprvuclb: A small xl nf expler!nl,ory vahblc;s nra

jr~uludcd as slqgestcd by xrd.wlnnliv~ tlleos.y, or t.o dlow tmt.ina of pwtic~rla a pnori hy~~olhcsw. Thc lnodrl is lypically motliG~l sorilwhnt la rwiovilig soma variable. c:orlsidcring ir~termlions, rt,r:. lo ~ l c i r w a bctter fit t~ t l ~ c clatn. I

2 E~plomlory u p p m c h : A~~turrini~ic selectiol~ methods, whir11 me of the folluwirlg I.ypae: a . F'uarrd ~eleclion: Thin ~rlcthod sttwls wit11 A rrtodcl c o n t d ~ ~ i ~ ~

nouc of the exphnntory varinlbles m l tlmn cousi(lrs mriahlrs one by ort for incl~~sira~. At PXJL stcp, tlm vwiablc ~ddeti is Ihc one tllah rmiilt~ in ( 1 1 ~ biggest incmwe iu lhc model or rrgR+ siori sum of s q i ~ m , An k'-typu slnl,iatic i s uscd to Judge wlwn rwlhsr additions wo~lld riot represent a signifiwlt irnpi-ovemeut in the ~notlcl.

11, flarb~laid elimination: Bcre mritiblw are (x)~midcrd for re1110wl from an in~tial model txmttaining all lllc explanatory varialtl,Ie~.

I At PWCII slnge, the vitriable c h w n Tor ex<:lu.sion is the o~ic lead- ing to thc slnellest. reduc~ioli in thc regres.sior~ sum of sqwm. Agttir~, an F-type stalist,ic is uwd to judp;c: whr~c further cxcltl- siens xwuld rcpvraent a bignifir~nt dc~crio~at~iur~ i r ~ h e rnoclel.

c. SLcpm.sc qws&on: This met.hot1 is c%4c?nti>dly n coinhination of the prcvious two, The forward sel~r:t.iou, prorsulnre is used to add ~triahlcs to an misting morlel and. after cach ad(lition, n hixkward eliiiuution stpp iu i~~lrodumd Lo asmxs rvhclher vmi- abics cntexed czrlier might now he rsrnoved hwnusc Lhey no longer cuntributr: si~nificnnlly to thc ~notlpJ.

li is (:lea that the rautornatir nelcclion methods are bascd on a large ?umber of significanrx tests, one for each variable considr~ed Cor inch-

Page 80: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

siori or exclusion in ~ a c h stcp. 11 ir well known that thc probability of a false positive T C S I P ~ ~ OF Type I e m r incremes with the number of tests. The chosen model sl~ould tlicrefore bc i~lte~prrrtcd wit.h extreme caution, pa r t i cu l~k if therc WEIT a large t~nlnher of wndids t~ vsri- a b l e . Another prohlsnl with the three ailt.ornat,ic pmcrd~lres is that t h y oftau do riot l e d to the same mudcl; sec &o IlmrJl (2001) for a discussion of model sclakion strategies. Alt.l~ough we woukcl gener- Jly [lot rcoomn~end automatic procedures. we will usc t lcm here lor illu&rraf;ion.

First xve will take a canfirmatory approach tu inwstigatc if climate (temp, wind, precip, days) or h~llman ecology (pop) or both are im- portx~~t predictors of air polll~tion. We treat t h ee groups of wiahles as single terms, allowing dt,her dl mriahles in rs p u p to he in(:ludd or nora. This can he d<~rie by enclosing Lhe twin1,~les in parcntllcrjes in t.11~ following comma1,cl:

stepwise, pe(0.05) : regress so2 (temp uind precip days) (pop)

(see Display 3.5). Here the prefix coxnmand stepwise is USPA with t,he

source

begin with empty mdel p - 0.0119 < 0.0500 adding temp w i d precip day8

riumber of obs - 40 PC 4 , 56) - 3.77 Rob > F = 0.0118 R-squared - 0.3010 Adj R-aquared = 0.2211 Root MSE = 17.586

- - -

Display 3.5

pe 0 (''probability to mtcr") option to indicate that fnmanl sel~ction should b~ uscd with n signifimre level of 0.05; ~CI'IIB with H p - d u e less than 11.05 will bc induded. H c r ~ . only Lht: climate variables are shown siilcc Ihey art joirltly significant (p = 0.0119) using an F-test.

Ax a fnrther illustration of ailtomatic ~elcctiun proc.~duras, the Ibl- lnwing Stata il~qtrurt~ion applies the bwkward elimination rricthod. wilh

Page 81: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Mwltipl.: Rqre&wzt Deferminants of Pollrrl,ion in KS. Cities U 73 --------------- ------

e~planut~ory wriables whwe lie-~alum for removal havc fls~odated p s a l~~c~ i greater t h m U.2 being remw~d:

stepwise, pr(Q.2): regress so2 temp pop wind preclp days

we Display 3.6). Hc~e, the p r o opt.ion inrlicatw that, bwkwclld selec-

begin with full model 7 - 0.P170 Z= 0.20M1 remouFng day6

Umber of obs - 40 F( 4, 35) - 4.60

1333.937 Prob > F = 0.0043 R - s q u a r d - 0.3446 Ad] R-~quarad - 0.2698

16485.9 39 397.074359 Root USE = 17.03

a02 C o d . std. Err. t nltl L96X Conf. Intenall

temp -1.810123 ,44040111 -4.11 0 -2.704iBX -.9160635 pop .OllSD89 .W74091 1.63 0.136 -.!I037313 ,0289501

wind -3.085281 2.096471 -1.47 0.160 -7.341347 1.170778 precip ,6860172 .2508601 2.26 0.030 .Q567441 1 07528

-con. 131.3388 34.32034 3.83 0.001 61.66458 201.0126

-ion skiodd be uscrl a i lh a "pmbahilrty to r~mnvfi of 0.2. 1Vit.h this ~igiRcancc: lcvcl, only the variahlc days is exclucled.

The next stage iin t h e anolyYis bhoulcl be ,ul examination of the .~afdunlu from the chmn model; that is, the differences brtween t,he " b w r t d and fitted wluw of sulphllr (jioxirlc concentration. Stdl a

2mct.rlure is uital for aswising rnotlel ays~~rn@,ions, identifying any un- >ual fe~tures in the dntJ.t.a indicating outlicrs: and suggesting possibly ;in?ylifying transfor~nxtions. The mmt nsdui wnys of sxarnir~ing the :-dluals are grapl~ir:d, and the mmt mrnmonly used ploCs are a fol-

w5:

m A plot uf the residuds agains? ewh explanatory wriahlo in the model. The presence of a curvilinear relationship. Tor example, wo111cl sugpst that a higher-order leml. perhaps s qudrntic ~ I L

the explnuatory \xrinhlc. shuulcl be a d d 4 to the model.

B A plot of t h ~ residuals tlgilinst predicted claluei of the rpsponse

i mrial>lc. Lf the varimre d thr residuds appcms to i~~crease or dtercnw with the predicted rduc. a Iru~sfnrmntio~l of thc rGponsp m;lv be ill ordcr.

Page 82: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

m ~ u l . ~ ~

20 30 40 50 Find values

Figure 3.3: Residuals Gainst predicted resporlse.

A normal probability plot, of t1ic rmiduds- -after all systematic vttriation has hecn removed from the data, thc residuals shmild look like a sample from the normal distributior~. A pbk of the ordered residudq agaiust the expec%erl order ytntistics from a normal clisiributinn (wit.h mean and varimce qual to the sample estimates) provides s graphid (;heck on this axsumption.

The hmt plots car, be ohtainoti after ~4timatio11 with thc regress command using the rvpplot ("residual versus predictor") and rvfplot ("rmidud vwsus fitted") instriictiotls. For a m p l e , For thc model d m sen by the backward selection prticedurc, a ptul of residuals against predicted vrtlucs with thc first three bt tea of the tow11 n m e ugcd to labcl the points in obtained usirig the cornand

rvfplot , mlabel(tva)

The res~dting plot is s1ioff.n i11 Figurc 3.3; mtnd indicates a pmif>lt. prob- Icm, namely the apparc-nt1y increasing variance of the residuals m the fitted d u e s iucrease (we also Chapter 7j. Perhaps same thought nreds to be given to t,he possible trensfomatio~ls of the rewo~~tic variable (W exwcise 3.1).

Next, graphs of Ibe residuals plotted against each of the four M- plrtnatory vnrial11es .rm bc obtained wing the following f oreach loop:

foreach x in p q temp uind precip rvpplot ' x ' , mlabsl(tvn)

Page 83: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 3.4: Residuals itgainst population size.

more >

H-ictc mare causw Strtta to palm after ~ c h graph has bcen plotted until -hp user prwes any key. The resulting graphs are shrrrvn in ngwee 3.4 -a 3.7. In each gaph the point corr~spollding to the town Provtdence is wnewh-hat distant from thc bulk of the points, and the graph for wind ;as perhaps a '"nint" of a curvilinear structure. Note that the appear- 5.rtce of I h w graphs coizld be inlproved msiug the mlabvpos(vamamt) ution to spocifv Lhe "clock position<' (c.g., 12 is streight hhcwc) of the

1 _,l-beb relative tu the points. The silnple residuds plotted hy rvfplot and rvpplot have a diutri-

irition that is sc;dcdapcndcnt because the wriancc of cach is a funr:t~on if both aZ and thc diagonal valum of the so-called "hat" matrix. H, 5rm by

;Qe Cork md \-Veisberg (1982) for a full wplar1atio11 of the hat nla- -!is). Comrqucnt,ly, it is oficn rrlore us~ fu l to work with n standardized :-erblon

Page 84: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figiuc 3.5: &duds against temperature.

F i ~ l r c 3.6: Residuals against wind speed.

Page 85: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

n-l,me 8' is thc estimate of a", GZ i. the predicted value of ti~c ~ w p o w , and I L ~ , is the itlr diagonal cle~ncnt of H.

T11eb.e st.aridhrdi~~d residuals cm bc obtained using the predict rolrunand. For example, to o b t f n a norn J prohdlility plot or the =randordizetl rmidnxls and to plot than against the fitted value re- q ~ ~ i r c s t , l ~ following institrur:tions:

predict fit predict sdxes, rstandard pncrrn sdres twoway scatter sdxes fit, mlabel(tm)

The first instruction stor* the fitted values in the vzrirtl>lc f it, the scc m.rnlld stores t,he atmd~rdiaerl reiduatr in tlic variable sdres, thc tliird >rotIuces a nnrmal probabilrty plot (Figure 3.81, mrl tllc idst instrue -!or) producev ill? graph of btandardized lrsiduah aga~~isl, fittcd wlllss, >-hidl is shown in Figure 3.0.

Thc nurrnd probability plot indicates that the distribution uf the rpsiduals departs sorncwvhel from narn~xlity. The pzttrJn in the plot ;!iol~n in Figlrc 3.9 is w r y sirnilm to that in Figure 3.3 bnt here value$ ,outside (-2,2) irldiciltc pus~ihle outliers, in this r m the point ram- +?onding to the t o m Providence Analogniw plots to those in Fig- -rrps 3.4 to 3.7 ~nold be vbtaiuod in Iht: sanlt! way.

A rich variety nf o t h r didgnositim for inwstigitting fitted reg-in11 ?lode14 h ~ q clewlopd and many of Llhese are ~vailable &er %Lima-

Page 86: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 3.8: Normal. probability pbl of staridardimd rcsidr~rmls.

Y 0 10 20 2 0 40 50

Fmed values

F i r e 3.9: Standardized miduals against predicecd dues .

Page 87: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Mifnllipk Rqssiorr: Dclcnnimwils of Poliutiut~ in laJ.S. Cities W W ----

-ion with the regress procedure (see he lp regreas postestlmat~on). Illustrat~rl herc is the use of two of thme, nameiy tlic jmrtinl msidunl ?lot ( M n l l r , ~ , 1973) md Conk's dislanr~ (Cook, 1977, 1979). Z'he for- 71rr m e useful in identifying whcthcr, for cxnmple, cluadratlc or h i g h ~ r

1 :,rdcr tcnns are rieedfd for any of the c~platlxeo~y v a r i a b l ~ ; the latter :ylerrsures thc changr to t.he estimates of thc regmion coefidcnts tlmt re-ults from deleting s x h o h e w ~ t i o n and can bc usrrl to indicate t h w .~hs~rv~tians that rimy be having ail lmduc influence on *,he estiinxtrs.

Thc partial rcsid~d plots rue obtained using t.he cprplot ("com- purrcnt gllln rcsid~rd") cornmmd. For the four sxplanat,ory variables

11 the wlectcrl model for thc polluitiun data, the rcquired plots are . hrninetl as follows:

foreach x i n pop temp wind precip I cprplot ' x ' , lowess mare

1 T11c lowess option producca a locally wightarl regression curvc or :,oolrss. The renulling graphs me shown in Figuretr 3.10 to 3.13. The crnplls hhve to be rxauincd for nonlirlrnsilics ~ l r l for u.4s~ssirle; whether - I F regrcs~ion line, w1iic:h lm sbpe cq11aI ttr) thc cstirnatctl cffect of the -orrr~purding mplimntnry mrirthh irk the dloscn ~nodcl, fils tlie dale sdequutcly. The wlcled lrnvcm cnrve is generdly hrlpfill fur I~nth. Nonc

, f t l~c four g~,nphs givm nny ubvious indiculioll of ~iorilinaarity. Tl~r Cloak's rlist~ticcs are for~nrl I I S ~ I ~ Lhe predict comrriand wit11

- i r cooksd option; the followir~~ cnlculate~ lhese stntistics f o ~ thr c.htr -11 mtlrlel [or thr pollutio~i data a11trl lists the ~bscrvnt~ior~s where thc -i:ktisti~ is genkr than 4/40 {4/n), which iu usually thc value rcgarrlwl * ii~dicating pawjilde prol,lerns.

predict cook, cooksd list t o w so2 cook if cook>4/40

( tom 802 cook I I . Phoenix 10 ,2643286

25. I IhL1& I9 3646437 1 ?3. Provid 94 ,2839334

Thc first u~struct~io~~ s to re the Cook's distanne statistiw in Ihe w r i - ?hie cook, and the scmnd lists details of thosc observations for which -hc sl ntist.ic: is above thc s~igaeslcd cnt-ofl p o i ~ ~ t .

T lme arc Ihrw! iufl ueutt,ial ob.wrvat.ions. Several of the dingnns?ic 3rordura; uerl previoosly also s ~ l ~ e s t these al>anvations as possibly =\lug rlsc tn pmblcms, arid some consideration should bc give11 to re-

Page 88: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 3.10: Part,iaI residual plot for population *a.

Figure 3.11: Partial residual plot for tr~nperiltz~m.

Page 89: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 3.12: Pnstial re idud plot for wind specd.

Figure 3.13: Partial resitdud plot for precipt,ilat,ion.

Page 90: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

82 H A Hundlook OJ Slat&@ AnaLys~s U m Stata _--- _____-__---

~ea t ing the mlilyscs with thme three o h m t i o m removed in addition t,o the initial rcrrlovnl of Chacago.

3.4 Exercises

8.1 Determinants of pollution in U.S. cities

1. weat the m l y s r s described in this chapter removing the three pmsiblc outlying obs~sv~t ions identified by Cook's distances.

2. The solution to the high correlation between LIE varhk>leu mmuf find pop adopted in the ehapt-er was simply to remove the former. Investigate other possihilities such as defining a nclv m~inble mmanuf /pop in adOitior1 to pop t,o hc twed In the regression analysis.

3. Consider the possibility of taking a transformation of sulphur diuxide polllltion before undertaking any regrwsion amlys~s. For exnmple, try a log trmsfonnation.

4. Explore the USP of thc many other dingrlob%ic procedures avail- able with the regress proc~Aurc.

See dm Exerciws in Chptm 14.

3.2 Extroversion and car care

Miles and Shevlin (2001) dwrihe a dataset collocbd in an invcs- tigation of how pe~ple project. their self-image through ob jrtts they owti, in this caw their cars. The main qumtion is how a pwsoo's exlroversion afTects thc amorir)t of time spent looking after his or her car. l3ut sirice it 1s knawn that cxtroversion is related to both gender ,and w, the latter two vi~riahics necrl to be c~nkoUcd for.

The variable in extroversion .dta are:

m sox: SPX nf rcspwdent (O=femaIe, 1=mde) M age: kge (in years) 1 ex: cxtroversion scare

car: time respondent spcnh looking after car (in n~lnutes per mek)

1 . Fit a ~uitahie regrmion model to d d r - the main r c s m h question stated a b m .

2. 1nterprr;t the estimated rgres ion mefiicients. 3. Pmfoi-rn some r ~ i d ~ r a l diagnoaticx.

Page 91: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

3.3 Mortality from skin cancer

1 . For the rrdigriant r n~ l a~~oma ddx given in Exercixc 2.3 of the previous chapter, fit the rnultiple regression lnodcl of mortal- ity on latitude, longitude, yop11Iatio11 size. and acean state.

2. Try to find a more parsimortiuux mudeb (one mitith fcww i=x- planator:, uariahles) t h t fits the data adequately.

3. Irtwstigatc the nsiurnptions of Qic rnodel by constructingsuit- nblc rcsidn&l plots or other djagnostic plot+,

3.4 Water hardness

Data w r c collectmi an Fh lzrgc Sewns in England and ~ V ~ C S to invmt.igxte the environrnantai cayis@ ~d Iliwme (see Hand et al.. 1694). Ibre ~ r c t ctlnsidsr the mnud mortality per 100,000 for maltlcr, averaged over the years 1958-1964, iwzd the calcium (:on- ccntratiori in parts par nlillion in the drinking water supply. (The 11igher the calcium conccniration, Ihc hwd~x the watpx.) Tourr1s st least as far north as L)cxby are cui~siderpd norther11 !.owns.

The wiablcs in water .dta are:

u t o m : string mriablc (Kznorthwn tnwtl, 5=mrlthcr11 town) r mortality: mortality per 100.000 for rnalcs per year I calcium: calcium conrentratiori in pa~ts per million

I. How m mortality and watcr hdrless rclat~d, arld is t,hcre a gcograpliicd fataor in the relationship?

2. E'or your chosen regression model, plot predir:terl mortality vcrsus cdriun) conccntraticm with sppmate regression lir~cs for northern and southern towns.

3. Superirnpwc LOIVESS c~m.es unto the prcdicicd i-egresqiorion lines. W h ~ h slssumptious riom this graph ullmv you to a...ssms?

I

Page 92: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 4

Analysis of Variance I: Treating Hypertension

-1.1 Description of data \Isxrvi4l id Delmey (1990) dmcribr. n slluly iri the cffwts of - i ~ ~ r c yobriWc trcutl~Iolltti lor I~ypcrtci~siun \vcr~ invcsbigalcd. Tllc <IF- -ads ol' l l ~ c I rcn l i r~cr~ ls nrc as rollow?:

Trealment Description Levels

d~ n i t d i ~ ~ t i ( ~ ~ ! dri~g Xj t i r i l~ Y< ~ F I I ~ X biofaed hic~ftwllr~t*k present. ahsmt d ia t xr)tu:ixl diet ~ > ~ ( ~ L K . I I L ~ i111st~r1t

2\11 12 colril~irintior~s or lhc tlircc Ircnlmcnls wcrc itlcludcd in a : r 2 x 2 design. Swer1t.y-two suhjt~ts sufl'c~ri~~g from hlvpcrtc~~sio~l Turrp 1-~criiited, and six wew ;tIloc:~t,cfil r~idorrily t.0 wr:h ~:onihixiittiori of - reat~nenl.~. D lood pressure ~ne%?~~~urernents were I I I H ~ F : on each sllbject .-ding to the dat.a sllowll in 'Fib 4.1. Questioris of intarest coricern -i'ft.rrricw in mean blood pressure fur thc diffcrc~lt bvels of the kthmc

--rLat~xlar~'r,s i ~ d t,he dfe~ - ts of i~itcrnctiuru b c t ~ c e r ~ thc trentrticntci or) - .sorJ presslure.

4.2 Analysis of variance model 1 .iiitable rrlr~rlel for these data i s

Page 93: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

86 1 A K d w k of Stulisfictal A d j ~ e s [Id- Staa ------.---------- -

Table 4.1 Data in bp . raw EioL4baik N(1 Diofwdhack

drng X dr~rg Y drug Z d r i ~ X drug Y rlrw Z Diet absent

1x0 173 181) 187 191 184 199 197 217 I70 1 206 204 176 199 184 198 185

Diet present 162 164 171 184 190 173 18.1 I 198 1511 164 1'351 18U 176 180 179 175 zua

where ~ i i k l wprprrsenbs the blood praure of the I th nuhjcjcct for the ith drug, thr jth level of biofeedback, and Lhe ktb lcvel of di&, p ifi tic ovcr- dl rncan, o,, fli, nnd yh are the main eflcccts For drupp, I~iofcmlbtlck, and diel,r, ( ( Y T ) ~ ~ , O J L ~ (l$y)y)5k arc thc firsharder irltrraction tams, ( N J ~ ~ ) , ~ L ib a scconcl-order irileraction tcrm, and et,kl we the rwidli~>I or crror tc1.m~ wumcd to be normally dist.rib~~tcd with mm nrioari md v~trim~ca w2.

To idcntify the nrodd, tiomc cc~twtrdnts Ilavc to be irrlpnscd oil the pasarnet~rs. The st,mdad constraints are:

Page 94: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

\Ic: can tcst thc Follcnving null hvpothexes:

HI!," : Nn drug effoct : al = nz = a, = 0

H:) : No hiofedbmk e k : t : = = 0

H:) : No drug by bbiofecdhack intcrndion = 0; i = 1,2,3; j = I::!

Sir~cc there are nn equal number of abswvatirlris in each (:ell of Tahb !.I, thr totd variation in the rwjiotl.ses C W ~ be partitioned into non-

1 userlapping parts (an ortllogond partition) representing main effects and interactions and residual variation s11o~pn in Figure 4.1. F-tcsts can thcn be constl.i~cted for each I~~-pot,hesis described abwc. More

I (I~tuils car1 be found in Lt-critt (2001).

i 1 1.3 Analysis using Stata

.iss~iming IIIQ dat.a are i r ~ a11 ASCII file bp. raw, exactly as dtom in Tnlde 4.1, i.c., 12 rws, the first containing (,he observations 170 186 130 173 LX9 202, Chey can be w i ~ l into Stata hy prnducillg a dictiouary ?IF bp. dct cont.ining thr following steter~~cnts.

dictionary using bp,rau -coLumetG) int bpll _colunm(l4] int bpi2 -columnE22) int bp13 -column<SO) int bpOl

Page 95: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

-- -

Souru hIodel

A R c m !\ C DC: ARC

Er~nr (Rw~dual) Totd

a-1 b- 1 L-1 (a-l)(h-1) ( Q - I ) ( C - ~ ) (b-L)(c-1) (a- ~)(b-I)(<- I) ahc(n- 1 j

i%is=q hlM$ [l,ctwven cells) h1S.4 MSB his c bfShn htS.\c: \ l S R ~ ~I~ABC USE (w~thin cells) (RAE)

Pigurc 4.1: ANOVA table for t.hreemy fuctorid dcsign with fxtors A, B, and C hwing a, b, and c levels, respectively m1d with n ohswv~t~ion~ pel- cell

-column(38) int bg02 ,calumn(46) i n t bp03

$

and using the following co~rimnnd

in f i l e using. bp, clear

Note that it w ~ s ilnt rwcessary to define n. dictionmy here since the same result could have been achiewd wing n simple inf iLe uarlt.st rorrrmiulrl (we cxcrcbea). Il~.re the mriahle nanles end 011 two digits. t,he &st standing for the levels of biofwdbwk (1: present. 0: abycnt). and tlw mund for the levels of drug (1 2,3 for X,Y ,Z) . The final dattud should Ivave a P;ing1~ variable, bp, t.lrnt conl:&m all the blood p~esnu-es. and three additional vxiabls . drug, bzofeed, and diet, rep!prcsenting the corwspondlng leveL3 of d n q , hiofwdback, and diei,.

First. create d i e t which sho~xld take on one value far the ErG jtx row and anather for the folIowing rows. This is ach imd using the c o r n ~ ~ ~ ~ ~ d s

generate diet - 0 if -n <= 6 replace diet = I i f -n > 6

or, more concisely! using

generate dlet - (2 > 6 )

%ow USE t l l ~ reshape long cormnalld to stack the coiumnns on top o! each other. i f we s1~1,~cify bpO m ~ d bpl the wriahlc. name in the reshape cornmat~d, bh~ed bpOl, bp02, md bp03 are stacked into one coluzr~n with vwiablr: n m c bpo (and similarly for bpl) and anot-hcr

Page 96: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Annigsia 01 V u ~ i u e I . 8C1

-:wiablc is creat,~xJ that cor~ttxine thc suffixes 1, 2, and 3. IVe ask f o ~ this irter rmitthlc to be drug using the option j (drug) as ~ollowh:

generate id = _a reshape long bpO bpl, i ( id ) j(drug) l ist In I/$

-ee Display 4.1). Here, i d was gen~ratcd because we neetlcd to specify -he rnw indicatnr in Ihc i 0 opt,ion.

3 3 180 202 9 . 1 175 194 5. 2 194 194

5 . 2 3 187 298 0 7 . 3 1 165 197 e . 3 2 201 217 3 . 3 3 199 190

Display 4.1

\Vc now rim the reshape long coulrriand again to stack up the -1lurnr1s bpO and bpl m c 1 generalc the variable biofeed. Thr instruc- - urls to achicve tlik and to label dl thc wiableb are given below.

replace ~d = -n reshape long bp, i ( i d ) j(bfofeed) replace i d = -u

label define d 0 "absent" 1 "present" label values diet d label values bioieed d label define dr 1 "Drug X" 2 "Drug Y" 3 "Drug 2'' label values drug dr

To begin, it will be hclplul tu loolc at somc sumraary stfitistics for -nrh ol the wlk of thc design. A simplc way uf obtaining thc rcqilird - .m i r r ia ry rneasturcs is to iise thc table rnm~nand.

table drug, cmtents(freq mean bp median bp sd bp) / / / bycdiet bzofsed)

Page 97: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

-- --

diet, biof aed and drug Freq. mem(bp> md(bp1 sd(bP)

aWent abeent

D r l y X 6 188 192 10.86218 Drug Y 6 200 197 L0.07969 Drug 6 209 205 14.3527

absent prssent D L Y I x I 168

167.6 8.602326 Drug Y 204 206 12.68069 L h W Z 6 189 190.5 12.61785

- present present

169 187 11.81891 172 170 10.93818

Drur? Z 173 176.5 11.6619

Display 4.2

Page 98: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

T11e stantlard deviations in IJisplay 4.2 indicatc that there are con- sidmahls differences i ~ i the within cell varia2,ility This ma!- have impli- cations for t,he analysis o f ~ w i m t ~ of these data: orw of theassr~mptions !ndc is h a t thc ohscmtion~ withi!) B ~ J rdl hwe Ihc same popuhtion ~uriance. 1'0 begin, hwcvw, we will tit the model specificd in Section 3.2 to the raw data using the anova command.

anova bp drug diet biofaed diet*- diet*biofeed I / / drugsbiofaed drug*diet*biofaed

The resnlting ANOVA table is sl~own in Display 4.3.

Number of obs - 72 R-squared - 0.5840 Root KSE - 12.5167 kdj RR-squared 0.5077

sollrce Partial SS di MS P Prob > F

Mudel 13194 11 1199.45455 7.66 O.OOD0

drug 3676 2 1837.5 11.79 0.0Wi d i n t 5202 1 6202 33.20 0.0000

biof eed 2048 1 2048 13.07 0 0006 distrlsug 903 2 4 5 1 5 2.88 0.0638

diet*bloieed 32 1 32 0.20 0.6529 ckugtbiofasd 259 2 129.5 0.83 0.4426

drug*diet*biofaed 1075 2 637.5 3.43 0.0388

Residual 9400 M3 168.666667

Total 22694 71 318.225352

Display 1.3

The Root MSE is simply the square root of the residual rneah quare, nitb R-squared and Adj R-squared being as descril>ed in Ckapt~r 3. The F-statlstjc of each pffwt 1.qrese1lts the mom1 sum of s q u m for rhat .tHect, divirlcd by tila wsidud mean square, giwn m ~ d e r t , h head- ing MS. There are highly significant r~hain cffccbs of drug (Fzqbn = 17.73. p < n.OOl), diet (k;,Bn = 35.20, p < 0.001), and biofeed (F13,+,+, = 13 07, p < 0.Onl). The two-rvay intermtima arc not sig~lilicnnt at the .5TF lml but the thr~fiway interactmion drug by diet by biofsed is ! F2 = 3.43 p = 0.04). The 1~xistrar:e of a thre-way interxction corn- pIicatrs the inl~rprct,ation of the other tcms iri the modcl: it implies that the interaction heta-een any two of Ihe factors is different at the diHercnt Ic~I.els of the third tictor. Perhaps the b ~ s t wfiy of trying to ~~ridsrs tmd the meaning of the three-N~Y interw%ion i s to plot a num- her of zfitemction diqrums: that is, plots ul mean vaiucs for a €actor

Page 99: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 4.2: Interaction diigrnrns silowing Lhe inleraclion bct;ww~l diet nnd biofeedback for ~ x r l i level or dnig.

nl t.he differ~ll lt3vels of thc ot1mr factors. Thi can bc drmc hy first ccrcnting 8 miablc predbp cont,drling

the prwlic~ed Incans (which in this v ~ c r:r,inci(e with the o b m c c l cell Illcans bcca~m the inotlcl fitt~il is saturdcd, LC., the nimlber of parxmetcrs is cqtral to l;hc lilrrohcr of ccll rxica~ls) using the cornrxmnd

predict pxadbp

Plots of predbp wainst biof eed fur c a h l e d of drug with separate liner: for diet can bc obtainatl using the c:omrnttnd

twoway (line predbp biofeed i f dlet==O, sort) / / / (line predbp biofeed if dlet==i. lpat(dash) sort) , /// bycdrug) xlabel(0 "no biof eed. " 1 "biof eed. ") // / ylabelll70 190 210) xtitle(" It) / / / legendIordsr(1 "no diet" 2 "diet"))

The rtlsujting int~ractinn d i w m n s arc x h w n in Figurc 4.2. For drug Y, the presence of hiofcedback iucrcwes the effect of diet (the v~rtical rl~qtance bbr(,ween the solid m d dashed lines), w h c m for drug Z the effect of dict Lr hardly a i tm~I hy the prmencc of biofrmll~ark and For drug X the effect h ciwreascd.

Page 100: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Tablw of the cell rueans plot,ted in the interaction diagram, na well a< t,he corrcsporiding standard deviations, art! produc~d for cach drug using the following commanrl:

t a b l e diet biaf eed, ccntents(mean bp sd bp) bycdrug)

airing thc out-put shown in Display 4.4.

Display 4.4

d n g and diet

= w X absent

present

m y ab-t

prssenr-

m g akent

preaemt

As rncntior~csd ptcviuusly, the ohsewations in the 12 wlls of the 3 x 2 x 2 dcsign imvc vari~~~ccs that differ c:or~sidcral)ly. Co~iwq~~enLl,y, an analysis of vnr ia~~cr of the data trnnsforrned in some way might ilc worth consirleriry. For example, to analyze Ihe l u ~ transformed ~ , ~ S P T G H ~ ~ O ~ S , we can us? the following cotnmarids.

biofeed absent present

188 168 10.86278 8.602325

173 189 9 797959 14.81891 --

200 204 10.07968 12.68069

187 I72 14 111428 10.93618

209 189 16.3527 13.61745

182 173 17.1114 11.6619

generate lbp = log(bp) mova 1% drug diet brofeed diet*- dlet*biafeed ///

drug'biofeed hg*diet*biofeed

Thc r e s ~ ~ l t i ~ @ analysis of variance table is shown in Display 4.6. The results me sirt~ilnr to those for tl~r nntrar~4Forrn~l bloorl p r w

SUITS. 'I'he Ihrpm-way i ~ ~ t w x t i o n is only margirlally significant.. If uo suhtantivc expkaaalion of t hi? int~~nctron in milnhle . it nr ight be bcl-

Page 101: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Ilumbr of o b ~ - 72 R-squared - 0.6776 h o t USE = .06[0013 Adj R-spared - 0.5002

Display 4.5

Source

k d e l

&at

drug biofeed

dlet*drug dist*biofesd drug*blofend

diet*-blofeed

Residual -- TWtal

ter 60 intprpret t.he results in terms of the I-ey significant main efferts. Thc rclcvanl summary statistics for the log transformed blood prl;ssures car1 be obtained using the following iustructions:

Partial SX df H6 P Rob > P

.379534762 11 -03450316 7.46 O.QOC4

-169661659 1 .I44661569 32.33 0.0000 .I07061236 2 .053530618 11.57 0.0001 .06147$507 1 ,061Q75507 13.29 0.0006 -024011594 2 .012006797 2.60 0.0830 .OD0657678 1 .000657678 Q. 14 0.7075 .006467873 2 .003239136 0.70 0.5010 -033299315 2 ,015148657 3 28 0.0447

.277S45987 60 .0(146251b8

.657080749 71 .o09!26#653

table drug, contents(mean lbp sd lbp)

table diet, contents(mean lbp sd lbp)

table biofeed, contents(mean lbp sd lbp)

giving the tablm in Displays 4.6 to 4.8.

Display 4.6

Page 102: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

-. . Andus& o j Variance I . .. 95 .

Display 4.7

:=of eed mean(lbpbp) sdllbp)

absent 6.242295 ,0890138 Zrssent 5.189854 ,0953818 t

~ Display 4.8

Drug X appcars to produ<:e lower blood pressures as cow the spwial : and t l r prePwnce of biofccclbnck. Raiders wtrr pnco~~raged to Cry : - ner trm~formations.

Sote that it is swy to cstimate the mode1 with main cKccks o~lly >:Iia regression wlth rlurnnly vminblw. Sinw drug has t11rec lcvcls and -:+rcfore reqilirm two dummy variables, w slaw? mnlo time by using -:r xi prefix as follows:

I xi: regress lbp i.drug i.diet i.biofeed

i;.di~ig to the rea~lts shown in DispLRy 4.9. The coe%rients rpprrxcnt -:* mcnrr diKkrencw btrwm each lcvcl cozr~pared with the reference -:el ( ~ I J c 01rlittec1 categonw: drug Xb diet abumt. and hiofedhwk :-.;cnt) wher~ the ot,hw variables ,arc held cor~htant. Thr p-tritlues arc -.ml to thrse of AXOVA wit11 main efFetAs only, except that no ovcrall --i-aluf: for drug is givcr~. This can be obtained using

- - :e F-stvattistir. is dilt'crc~~t frorn that. in the last anova cornrnmrl be- - .lse no int.erections w c ~ c included in the model: I~ence Lllc rcsidud I=-xecs uf freedom and the residual sum of sq~rtras havavp both incrcasccl.

Page 103: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

i . drug -Idrug-l-9 (naturally ccded; _Idrug-1 omitted) I .diet -1diet-0-1 (naturally coded; -1daet-0 omitted) I.blofee8 -Ibiof end.0-1 (natural ly coded: _Ib~oEeed-0 omitted)

Saurca -- Nmbr of obs - 72 F( 1, 67) = 16.72 Prob > F - Q.0000

Residuai .338882447 .00505814 R-squared - 0.4841 hdj R-aquared = 0.4533

T o t a l ,657080749 ,0092546E4 Root ME .07113

Coef . Std. Err,

6.233949 .0187C43 279.23 0.000 5.196535 6.31363

4.4 Exercises

4.1 Treating hypertension

I . Repmd~rce the rwssulb oftltc cornrrlnf~rl inf ila using bp witll- out using Cllc d i c ~ i o n ~ ~ y , anrl ftdlow thc reshape irlstr!ir.t.iuns t c ~ p11cr~t.te l l ic rrquirrtl dnlawt.

2. Prndiicc three dingnmx wit11 hmrl)b($ of Montl prcssurc: (1) for cadi Icvcl of drug, ( 2 ) fnr c,xh lcvcl of dict,, m ~ d (3) Sor radl l~wk or bir>fectll~wk.

3. Invwt i~Lc ot.her pwiblc tra~isforinakioiuns of thr: reapollse mri- Jjlc and compaPPc Ihe rcsult.i~l~ nnsiym of vtlrimw with tliose given irr Il~r! text.

4. Suppose that in addition to the blnt~d prwurc nf each of Il~e inrlivid~rrtls in the sludy, lhe ir~vcsligxtor hd nlso rccorrlerl thrir sgcs in the file age. dat as diown 111 T&le 4.2 (1)ut with data on one pcrwri PP*. row). Rctitldyzc the data using age w a cnvariatc: (sec help merge and help anova).

4.2 Auto pollution fllter noise

The ilata used hcre arc from Leruin and Shekrill ( I 976) and the Dnbn and Story 1,ihr~ry ( l i b . stat. crnu . ~ ~ ~ D A s L ) . Tlicy were originally used as part of a stateme11t by T c w o Go the Air and Mhter Pallutiun Silbcoommittcc nf tllc Senate Public 5.Vorks Corn- mitt* on Jmw 2S, 1973. Mr. Johr~ AdcKmlcy, Prcsid~nt of Tex-

Page 104: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Table 4.2 Data in age. dat

Page 105: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

ace, citcd a11 autornot,ilc filter develop~rl by kssot:iatcd Octrl Curnpany ~LS &cctive in redrrring pollution. However. questions h d b e ~ n r a i d about 4,he effccts uf filters on wyidt performance, fuel conuuinption. exhamst ga4 back prmsnre, and silmcing. On the las t que.\tion, lie r~ferred to the data included herc rrq cv- idencc tha t the silencing properties of the Octel filter w r e at least equal to thme of ktmdard silwicers.

The variablw in filters . dta are:

noise: nvisc level (in dccibds) 8 size: vehicle! size (l=small, 2=medium, 3=larga) w type. type of filter (l=sranrlnrd t i i l c n r : ~ ~ 2:=Octd filter)

side: side of car (l=righ~t side, 2=1dt side)

I. Produce table& of mews arid standard deviations of noise by t y p e and size.

2. Flt a two-way ANOVA model with noise as rmporlsc vuiable mtl type and size as Fdors.

3. Produce apprupriate interaction diagrams and llsc them to i~ lkrp~~ct the results of the twutvay ANOVA modrJ.

4. Now fit a tlircx-way ANOVA morlel with size, type. and side ax factors.

5. Frodi~ca npprogriatc interactmion d i ~ r ~ w n s and usc thcm tu ir~terpret the results of t,he I h r e m y ANOVA model.

4.5 Efficiency of cycling

K ~ p o r (1981) investigated the cffe~t of knw-joint angle on the efficiency of cycling. Efficiency was measilretl in tcmb of distm~w ycddled on an erpcydc until exhaustion. Thc expcrinicntex ~clected three kneejnirit a~lglcs of particular interest: 50. 70, and 90 degrees. T h i r ~ y subjects were available for the experiment md 10 subjects wcre randomly albcatd to encl~ angle. The drag of thc ~rgocyclc was kept constant at 11$.7N, m d snbjects were intitructed l o pcdd at a constant speed of 2OkrnJh.

The variables in cycliqg.dta are:

id: sn)~,jcct idcntificr k group: knee angle glhonp (1=50 degrees, 2=70 dcgrt?m, 3=90

depecu) h. rlistanta pedalled (in km)

1. Carry out tul initial guphicd inspection or the data to mwss wl~ethcr there arc m y asp~ctg of thc okacrvations that might be a cause for mnc~rn in Iatcr auelpe3.

Page 106: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

2. Umive the appropriate analysis of vzlrimcc tablc for thc data. 3. Ii~wstigat~e differences in means betwcen the t.hrre angle pop

nlatio~is in morc detail using n suitable multiple cnmparison tcst ([Birit: sc.e help onevay).

4.4 Maternal behavior in rats

Here WE conside dava collected to 3tndy the maternal hehavtvjrx of laboratory rats { E v d t t , 7,0013. The response variable was the t,irna (in secfinds) requir~d for the mother to rctrievc the pup to the *best, after bping moved a fwd distmcc away. In t,he stlldy, three independent paups of pups of different ages (5 (lays, 20 days, and 35 dqb) were used.

TIlc xwiabEcs in maternal .dta are:

a mother: rat mother idmtifier I R age: age of pup (I=5 days, 2=20 days, 3 4 5 days)

( m tme: tinre to retrieve the pup (in wonds)

1, Procluce a lahie of rnrmts a11d standcird demations of time by we.

2. Carry out x unc way ;cnulysis of vaiance of the data. 3. Ilsc Sn orthugonal polynnmid ylproach to investigate whetlrer

there is m y eviclmce of a linear 01. qudlbntic trcrld in the group means. If the rnudel is

the h e w cot~trmt is a3 - a, md the quadratic ccmtrast is 02 - (ar + a3)/2. Ust! the anova comrnsnd, followed hv test, showorder to find out the order of the columns iri the design matrix. Then ddene a onorow matrix for e d i contrast with ckments equd t o thie required contrxst twefficients. lJ% the comrnand test with the mat optjr~n to tmt the nuU hyp0tflt.- scs that the co~~irnsts we zero (i o., no lincar trend snrl no quadratic trend. r~.spmtit-ely).

Page 107: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 5

Analysis of Variance 11: Effectiveness of Slimming Clinics

.5,1 Description of data Sliinming cfiiGc.~ aim to help people lobrc weight by offcring erlcouragc- nlt.llt a d support about dietiw through regular rnwtings. A study xas carried otit tto asscss their effectjvenms. Half of thc clients pw- -:ripatin(: in the study \ve~.e randomly sclcded t,o rcr:eiw a tcciiliir~l rnanual c:ol~tainirlg slirnmirig advice based un psyd~ologicd hchwiorist -I-!~orj- to invesfigatr! if this woulrl llelp them tu control their diet. Some i the clierits hdcl previo~isly tried to slim whereas others \ w e ~unrium. - .he date. collect.ed are shmvn ill Table 5.1. (T~IPJ- are nlvo givm in

?rind ct al., 1094.) The rcsponw mriaMe resp wos delil~ed as follu\r.s:

weight, after three month or trca,ztment - irlual \wight --- (5.1)

init,ial \wight - idetll wcigl~t

Thc dcsign vnn Be thought of as n 2 x 2 fwlo~ial rlccsign vhel c manual ' . ~.~(?ived ;2 D I ~ I L I I ~ . 2' did not) is crowd with exper (1: plewous

- nrming expcricncr. 2: riovirc) The nnrrr1,er of ohseivati~~1s ill earh - . nf the d ~ s i ~ t l i h 110t tllc barup, 50 this a11 ma.mpk of dn unhnhriced - 2 d w p .

101

Page 108: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

TPrhlc 5.1 Data in slim. dat

expar manual resp exper maaual resp 1 1 -14.67 1 1 -1 8.5 1 1 -855 1 1 -2303 1 1 11.61 1 2 0 81 1 2 13K 1 L 2 74 1 2 3.36 I 2 2 10 L 2 -0 R3 I 2 -305 1. 2 -5.98 1 2 -3.64 1 2 -7.33 1 2 -3.60 1, 2 -0.94 2 1 -J.H9 2 1 -4.110 2 1 -2.21 2 1 -3.cifl 2 1 -7.60 2 1 - 13.92 2 1 -7.174 2 1 -?..XI 2 I - I .ti2 2 1 -12.21 2 1 -8 83 2 2 5.84 2 1 1.71 2 2 -4.10 2 2 -5.19 2 2 0.00 2 2 -2.80

5.2 Analysis of variance model A suilnMc ran~lysis nf val*irtncc motlcl for tllc dat,a is

wherc! y,,k reprwnts t l ~ c weight dlailgr or the kllr ir~divirlunl having cxprsicntx stat,~~s , j and ~nan~inl conditio~l i , y is the ovcrall mean, ai rcpreucnts t l ~ c ~ f i e c l of rrianuxl condition i , 4 the c r t of mpcrience A L R ~ U S jl yqj l , h ~ e x p ~ r i e ~ l ~ ~ X rnnni~til intflwbion, and ei31; f l i t erruls. The crmls are ms~tn~&l lo h v c R normal distribulion with nlcm zero rtnrl variut~ce ma.

Tha imbalanwd ~laturc of tlic dirntuing data yrcswts amlc difficul- tics for andpis not mr:ountmed in factorid dwigls hiiv~ng thc sanc w1nht.r of ohcw~tions in cwl~ ccll (sec the prcvrous chap~rr). If the data wcm balanoerl, Ihe twnnnrr cclh surn of saimcs would ~arlit.ion ort,l~ugo~~aIIy int,o tiwee colnporicnt sums of qlLrcs sqrwx; t i ng the twc) Insin ~ffrcts nncl their intwactiu11. I-lr>wcver, with unbnlxr~ced data tllere is no uniqne way ool finding a "sum of squaresn i"0rrespontling tu rar:h rnair i cff~c% u ~ d thcir irllcrac%ions, LCCRIIVC thmc efiecls are no longer indepcndm~t of vnc another. Sew14al methods have been propols~d for dealing wit11 this prvblcrn and eardl 1 ~ d s to a diffmcnt prtrtiliw or t.hr w e d l sum uf S~CWPE:. The different met11ods for ax- riving at the sum? of quareb for u~~balax~ccd designs can he cxplnined in terms of the comparjsons OF diffcr~ut sets of specific rnodcln. For a

Page 109: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

lesig11 with two factors A and B. Stata can cal(.nlnt.c: sequcntizl Rums of

, srlnarw ol.l~niqt~e 3111ns o f s q u a r ~ as dweribed in the ncxt suhsoctions.

5.2.1 Sequential s u m of squares

Scqucntial surns of square (also kriowrl as hierarchical sums of squares) represent the effect of adding a term l o a11 ~xi8ti1lg model. So, for t.s~tnple, a sei of sequential sums of squares such as

rcprwcnt n comparison of the following modelti:

r SS(bR(A,R)-modd including rn intcrnction and mxin efferts compared with one including only main c f f~~t s .

a SS(B[A)-modd includiug botli m d n effccts, hut with no in- teraction, conlpmed with one includi~ig only thc main effects of factor A.

w SS(A)- model c o n t ~ n i n g only thc main effect of A compuwd with one containing only the owzcall memi.

The IIW of these sums oE squarm in a series of tables in which the -'fects are mnsidcrd in diffcrenl orders (mc Iatcr) will ofken provide - ' n ~ most satisfactory way of deciding whidi ~rlodel i s mod appropriate 71r the ~ b w m t ~ ~ ~ r ~ . . (These are SAS Type I surm of squares- - x e Dm 2nd Everitt., 2002.)

5.2.2 Unique sums of s p a r e s

3 y default. Stnia produca unique sums rd squarm that i-eprewnt the , vn t r ib~~t~on of each term to to amodcl i~lclrlding all the other terrrs. So, r .3~ a lrw-lartov dc.sigri. the suns of squmes represent the follmvj ng.

These are SAS Type IT1 sums of sguww.) 5 o k that thwc sums of 3cluarw generally do not ndd up to the model aumr of squares.

Page 110: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

As we Raw shmvn in Cllaptw 4, ANOVA models r n q also he ~stiumtsd 1fiin.g regressin11 l ~ y delinil~g suitable dummy mriahlei. Assume thtal A is r e p ~ w n t d by a singlc dunrmy vazlri~ble. The rg rcs ion mcfficierlt for A reprmnts the plartxal co~llribntion of that wn~ble, adjlisted for all othcr variahlcs in the mode1, say B. This is cquisdetlt to the con- tribution of A tu a model already induding B. h cornpliclttion with ragl-wsicln models is that, in the presence of an interactjnn, the pval~ues of the terms dege~ld on the c x x t coding of the clummy variahlcs (we Aitkii , 1978). Thc unique slms of square correspond in regmaion where dummy variahlc3 are coded in a particular witv, for cxarnple a twuiwel f'xtor crul he coded as - 1 , l .

There Imve becn numerou discussions over which sum of s q u a e s we mmt appropriate for tlie analpis of ~inhalulced desfwr. The St.attta manual appears to mooinmend its dcfault for general urn. Npldpr (1977) and Aitkin (19781, huwever, arc strongly criticd of "correcting' rnain pffkcts for an interaction tmn inrvlvirlg the sanc factor; tlleilh criticisms arc hasrd or1 both theoretical md p.itynutic argume~~ts and seem com- pelling. A frequently used approach is therefore to tesl the high& wder i ~ l t e r ~ i o n adjusting for all lower ortlcr inter~ctions and not v i e wrsn. Both Nelder and Aitkin p r e h the of Type 1 sums of squms ~ I I associatio~~ with diffcrcnlt orders of cffcxts ar thc vrowdure most likely to irlentify ah appropriate mudel for a data set. F'ur a rletailcd rxplai~ation c>f thc various typc3 of sunxi of sqimres, see Boniface (1995).

5.3 Analysis using Stata The rltlt~ can be read in from an ASCII file sl~rn. da t ill t,lic u?ud way usir~g

infile manual exper resp using slim.dat

A tttblc showing thc. unhalminmd mture of the 2x2 dcbign c m be obtained rrom

tabulate manual exper

2 17

Total 16 18 34

We nuw use the anova c~)mrnend w ~ i t h no options specified t u obtain (he nniquc (Typo 111) sums of qumw:

Page 111: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

AwlvgsL~ oJ Variance I ( : flff&iweru?m of Sl?!~n.ing Clinics W 105

anova resp manual exper manua3*sxper

8 scc Display 5.1).

Htmbsr of obs = 34 R-squarad = 0.2103 ~ o o t MSE - 5.9968 Adj R-squared = 0.1313

Display 5.1

Source

Model

manual expmr

manualmsrxpnr

~esidual

Total

Our recornmand~tic)~ is that the sums of qliar.rRs tihown i ~ i this lablc w e no1 usod to draw inf~xencrs ~ I K B U . ~ ~ thc main effect.~ have beer1 atljustcd Ibr thc ii~t,arar.t,,inn.

Tnskad we prefer an analysis that consists of obtaining two sets rf scqucntinl sums nE aquarm, the first using the order m a n u a l exper

=anual*exper and the swond the onier exper manual manual*exper. The nr(:rssary instructions are

Partial 88 df Hs P Frob> P

287.291E6l 9 95.7439537 2.66 0.13559

2.19850409 3 2.3P860409 0,DB 0.8DBil 265.871053 i 265.871053 7.39 0.0108 .I30318264 1 .130318264 0.00 0.9524

r o 7 a . ~ s i z 30 3 a . o e i 6 ~

1366.07948 39 41.3963635

anova resp manual expex manua1*exper, sequential

pee Display 5.2) .

anova resp exper manual manual*exper, sequential

see Display 5.3). The s u m of squaccs corr~spnndiug t o modcl and :csiduals nrc, of course: the same in both tablcs. as is the sum of sqmres i r the irlter11ction Lc1.m. What. differ are the sums of s q ~ w e s i r ~ the zanual and exper row in 'he twu tables. Tho term? of most interest z.rc lhr slim of nqimreu of experlmaaudl ruhirh is ohtdried from Ihc -allle RS 265.91, wrrl the surri of squarcs of manuallexper which is 2.13. Thcsc sun18 of squaw8 m Iws than thc sums of sqniires for exper ~ ? i ( l manual alonc (284.97 ancl 21.19, r~s~)e~t ivc l ,y) , by 811 aamrlnt uf 19.00. a portion of the rliodel sums of squares whidl c m o l bc uniquely :rrributerI to either uf t1c vnriablm. The assockated F-tcsts in the -TO t a h h make it deu thal lhcrc is no intrmctiori effwt and t.hat ;xperlrnanual in significanl h u t manual lexpar is not. T h e consh~sion is

Page 112: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Rw8Idual 1078.84812 30 35.981604

Total 1386.07998 33 41.3963631

Number 02 oba - 34 R-squared - 0.2103 Root HsE - 5.9D68 Adj R-squared 1 0,1313 1

Source 5eq SS df HS P Frob, P ~ Hodel

Residual 10TB.84812 30 35.941604 1

1956 .Of498 33 4I.3963631

287.231861 3 95.7439637 2.66 0.0659 1

Hmbr of obs - 34 R-quaxed = 0.2103 Root USE . 6.8863 Adj R-squared - 0.1313

source seq. 53 df AS F Prob > F I

that only exper, i.c , whether bhe wolnm had beer1 dimming for over one ycar, is important in determining weight chWc. Provision uf the r~utnual appcws lo hwc nu diccernible effect. FI~IIM 5 1 illustrates the two ways of pwlitioriing thc rnoclel suixs of squarw into eornpon~lt$ due to exper (large circle) and manual (sriurlall citcl~), deppending on the order in whic:h the main d e c t s arc cntered.

Rcslllts pquivnlenit bo the unique (Wc 111) suns of s q ~ m s ta11 be obtxirlcd using rcgwfiion:

Hodel

generate mmuall = manual recode manuall 1-1 2=1 generate experl = exper recode experl 1=-1 231 generate exp-man = manualivexpsrl regress re6 manuall experf axp-man

287.291861 3 95 7439637 2.66 0 0888 1 I

Page 113: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Analysis oJ Vonrmcc I f : EjJectis~rnar;~ of Wtrrimiitg CIinics W 107 - - - - - - - - - -

-. .- IXUTF: 5.1: Venn rliagrms showing wquentid sums or squascs for 7-eighl law ~htdtrs. The large circle represents exper and t,he small circle zanual. Tn the left parlcl manual eT1ters the model first. wend in thc - cht panel exper P J I ~ ~ T S the rnodel first.

4~ Display 5.4). The pmluw awce with those b a e d on unique snrris

Display 5,4

eE+&+& Number of obs - 34 F( 3, 30) - 2.66 Prob > F = 0.0659 R-squar& = 0.2103 Adj R-squared = 0.1313

Total 1386.07998 33 41.3982431 m o t MSE = 5 . ~ 9 6 8

- qlares. However; thmc ws11lts differ h o ~ n the regression uwd by -- -a's anova with thc option regress: 1 -

rcsp

lanuall sxperl

expflan -cons

=ova rasp manual exper manual*exper, regress

Coef. S t d . Err. t P>l t l [95X Cmnf . Intervall

,2726251 l . lM60Q 0.25 0.806 -1.979204 2.624454 2.998012 1.102609 2.72 Q.011 ,746213 6.24987 -. 066375 l.lrl2809 -0.06 0.952 -2.318104 2. iB6454 -3.960968 1.102609 -3.59 0.001 -5.212787 -1.70913

Page 114: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Source Umber of obs - 34 PC 3, 30)= 2.66 Frob > F = 0.0659 R-aquared - 0.2103 Mj R-squared = 0.1313 Root WE - 6.99W

map C o d . Std. Err. t P>tZI [95X CCmf. Lnterv-all

_cOI1S -.7656686 2.k48183 -0.31 0.759 -6.756624 4.24319 manual

1 -.4125MH. 2.9984 -0.14 0.891 -6.538049 5.711049 2 (dropped)

exper 1 -5.863933 3.049491 -1.03 O.OW -12.07897 .36230U 2 (dropped)

manml*expar 1 1 -.2655W2 4,410437 -0.06 0.962 -9.271815 8.741816 1 2 (dropped) 2 1 (dropped) 2 2 (dtoppdj

Display 5.5

(see Displny 3.5) because this uses differer~t dummv w i a h l c s , coded n.: I for t h ~ Icvils shwn on thc hft of the reported coeficirnt and O othcrwiuc, i.e., thc dnrnmy variable fur manual*exper is 1 when exper and manual are both 1.

A tahle of mmn values helpful ill, intcrprcting thcse rcsuils (:an he found using

table manual exper, centent(mean reap) row co l f(K8.2f)

The rnpxns dononstrate that experienced slilimnlcrx %hiwe the gent- cst weight rednction.

5,4 Exercises

5.1 . Effectiveness of slimming clinics

Page 115: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

I. 11lvestigat.e what tiappe~ls to tllc sequentid sum of squares i f the manual*exper iulcrwtction term is g i~en before the mxin effects manual and exper in the anova m~~imncl xith the sequential option.

2. Lse regresa to reproduce the snaiysis of uxrimcc hy m d i n ~ bot,h manual and exper a!! (0.1) dumzny h n b l r s and crcat? ing m interaction variable as the product uf these dunmy variahl~s.

3. Use regress in coi~junction nith xi: Lo 6t the same model mithout the need to generate any dliminy variables.

4. Reprod~~ce the results or anova resp manual exper manual *exper, regress wing regress hj- making xi: omft the last cittcpry instead of the first (SM: help xi. under "Sum- mary of c.ontrolling the omitted d m > - " ) .

See also the Excmhs in Chaptcrs 7 and 13.

-5.2 Systolic blood pressure

Boniface (1998) provides dala h r n SInm,ell and DPIKUCY (1990) on systolic 1)lood pressure nf individmls. classified nccordi~~g fo their smoking statmils and Bm Ilirtoi-y of tir~ulatiou and h W pru1)lems.

The mzlriahltrs in thc datawt sy9tolic.dt.a arc:

history: famil?- history (O=no. l=ycs) smoking: smoking status (l=nou-smt~kcr. 2=e?i-smoker. 3=eurrent smoker)

m systolic: sy~tolic blood pressure I

1 1 . Carry out an analysis of varjnnce of thc datta retaining Lhe interaction only i l it i5 significant at the 5% l c~~e l .

2. Produce ETI appropriate graph fur interpreting the analysis of varianw results md state TOUP ~ U I I C ~ I L S I O ~ S .

3. Examine LIIR residuals h m fittingrr-hat you consider the most suitable model for tlw data. and us? xwjous plots to aruw thc ~sumptiurls of the a~~alyses >mu haw performed (see help anova postestimation).

5.3 Roletaking in young children

Ktcmch~trk ci al. (1990) studied role-faking i n c l i l d ~ ~ n . 111 their stud3; chilrircn bdmeerb th? ages of 2 and 5 pears were adlnu~is- tcr~d a hattev of roletaking tasks. Participants n-cre classified intu a group w!ao llad 11d IIO previous daytvczrr: exp~zieurx and a

Page 116: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

goup who had extensive d a y m expericnce. Children were &o sorted into t.wo age groups ( 2 to 3 years mrl4 to 5 years). The in- v~rtigatursV~ypothesis was that children with daycare expericnce would perform bcttcr on role-txkings ta~csku thm would children without daycmc cxpcrience because of the former group's greater opportuu~ty tyr social dcvclopment. Ther~: was dw i~ltercst in the of age md the posiblc irltcraction of age with day- caxe group. Thc dependent variable wa a roletaking score with higher values representing better pcrformmw.

'fhc wriablcs in role .dta are:

m id: chid identifier I daycare: dunmy variable for having had previous expcrirnce

of daycare I age: age goup (0=2 to 3 years, 1=4 to 5 y w s )

role: Score for performance on rol+phyi~hg ta.sks

I . Generate a variable equd to the mean or role for ench child's age and dayrare puup arid produce linc graphs of thwe means verslw ngc with separate lines for the riaycacc and no day- g u u p .

2. Fit a m-way ANOVA modcr to Ihc data. Simplify the model rn mmrlrh possible using a 5% significance bvel.

5. For the &own morld, plot a linc graph of the niodel-implied means (malugus to qucstion I ) . Also add the cormponding sample means represented by dots.

Page 117: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

I Chapter 6

Logistic Regression: Treatment of Lung Cancer and Diagnosis of Heart At tacks

6.1 Description of data - A >vo rlatmets will Be ar~cl?yzcd it1 this d~aptw. The first datwct shown :I Table G.l originat= lkonr a rljnicd trial in which lung cmtnrxr pa-

-.t.nts nwe randomized to rec~ivc! tnfo diflere~it klncls of rl~ernothcmpy -PqUnttid therapy and dtexnahi~>g therapy). The oulcome was dassi-

'i.d into anr of four catcgurics: progcsuivp disensc, no chmgt, partial -~nlisuinn. or ro~nplete rcmissior~. The data ware pr~blislred in Ilolt- -\i:ge anri Sch~lmnr l~~r (lg91) and dq~ appear in Hand t t aE. (1994). - . h p centrd qt~cs?io~l is whcthsr t,hrre is any evid~.r.c oC a diffewnrc in

-30 outmmes wllimrd by the two types of therapy.

mble 6.1 Lung cancer b t n in tumor. dat

Promessive Plo Partial Gomplotc - Therapy Sex disease changc: remission remission Srqnentwl ir Iolr 28 4.: 20 ZG

i ~ ? ~ ~ ~ ~ ~ ~ 4 12 5 2 Utpriratirlg hiale 41 44 20 20

Fcmnl~ IS 3 I ?

Page 118: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

112 1 A Hmdhuok of .Slatidicd A~r!u$es UsPrrp Slata _ _ _ _ _ _ _ _ _ - _ _ _ _ _ _ _ _ _ - - - - Table 6.2 Data in sck.dat

The sworid c l ~ t f ~ % t to he uwd in this rhaptcr arixs from a study investigating the ~ r n of serum c:reatine lcinns~ (CK) Lvels for tli* d iq - nosis of myor:arrlial infarction (horn attack). Patients drrlitt,ed to a coronary carp. unit becailse t l ~ y were suspec:ted of having had a mnyocar- dial infaretion within the Inqt 448 hours hiicl thew CK ~ E V P ~ S rncasared on admission and the next two mornings. A clinicinn who wks ls"Mi~ld" t o the CK rca~llts came to m indey~adaxlt "gold stnndard" diagnusis using c~ectrocwdiograms, clin~cd records, and nut.op~iy reports. The m i - mum CIC levcls for 3h0 pahefits arc given in Table 6.2 together with the rJinicim5s diagnosis. The table was taken frorn SarJtett ef, ~ 1 , (1991) (with pcrn~ission of the publisher: Little Bru~vn 9t Colnpwl>l), ruhere only the ranges of GK levels m e givm, not their pr&% values.

The main questions of inter&, fur this swond datiaset are how well CK disaimir~ates Lctwca~~ thnse uith md without myocardial infatc- tion, and how diag~mtic. tests pcrfornr that we b w d 0x1 applying dif- Fwcnt ti~rwholds to CK.

6.2 The logistic regression model 6.2.1 Binary .responses

Dicbotan~ou~k or b ina~y responws arise when tEic outcornc i s p,rcsenc:e or absenrx of a di~1ra1:te~istic or ~ ~ ' c n t , for vxmilpl~ ruyocardia1 iufmctio~ iri the se~vlnd da tmt . What we nwl~ld like tn do is to ir~wstigak ttbF etTwts of n riumber of ex1>ianatoqr variables on thw hinary rwponw variohb. This appcm to hc the smne aim ~ b . fur rrl~dtiplc r c p r ~ ~ i o z

Page 119: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

rliscwsod iri Chaplcr 3. wherc the modcl fur a. rPspotw SI, wtd explmlt- iorr whblea X I , fa z, can LC written as

S~r ia~y responws ate typicaliy coded 1 for the evellt of interest, sudr <as .nfarct prcscnt, and 0 for thr! oppositc event. In th is CRW t11c expcctd ~ F I I I I C is simply the probability T, that the evcnt of intercst occurs. This raises the first problcm for applying t11e model above to a Binary rrxpoIlse vnrillble, naxnely that I. t.he predicted probdility must satisfy 0 5 T* ir, 1 whereas the l~ucar

1mdp1 ~L~uve can jS*ic!d any wi111r from rninils infinity to plus infinity. .4 second problerrl with usirig lincar regression is that 3 t l~c obscrvd values of :yt do not follow a norrtl~l ~IistribuLion with

rrlcan K;, but rathcr a Bernoulli (or Binomial(l,~,)) ditribution. C'onsqucntly a new approac:t~ is necd~d, and this is providecl by logistic :egrtssion.

ITI logistic regmqiun, the firs% problcm is addr-1 by replacing the 3rubabllity a, = B(pzJx,) on the left-hand side of equation (6.1) by thc qt t of the proh&bility; giving

- _ha logit nf Ihe probability is simply the log of the odds of the cvmt i i~~termt. \TTritirq ,O and xi for the colmrin vectors (go, f i t , . . . , i7p)'

-.!?rl (I: X I , , . . . , sp,)'; rmpectively, the pn)l>nbility as a function of the -,-~wriatru is

.. . A hen the logit takes on ~ury real mluc, this prob,zhility al~z~ys salisfieu

5 T , 5 1. This is illr~qtratd for a single covarinta in Figure 6.1 =?ir.re thc logistir function is show^^ along with a linear function. For - between Q 2 nncl 0.8; both functions are simihr, bul w n approar:ilcs

allti I , t,lie logislic function curvq flattr~ts, producing an "S" slmpr The second problem relxtcs tu the estimation pro~vdure. ?Vtiercas

=asimum likelihoud mtimation in conv~nt~lorinl lixiear regRx~ion lends - l e ~ s t q ~ ~ a ~ m , this is not the case irl logistic regressio~~. In lgislic yression thc Ing likelihood is ~naximiwd riuint.rically u~ing m itcra- - - . e algorithn. For fill1 details of logistic rcgressiou, scc for em~nple

Page 120: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 6.1: Linear and logistic fimctions of x.

Gllett (2002), Agrwti (1996), and Long and Frecse (2006). (The I& reference provides a cornprehensiv~! discuwion of rcgcmion models for categorical variable ming Stata.)

Logistic ~gression clul he gen~xaliicd to the situation where the re- qonsc ws~rtbla has more thm Ln ordered response categories. In the lrttcnt rexponse formulation. we think of the ordered mtegorirs as r e p mcnting succwive intervals of an ilndcrlying latent (ur~ohservcd) con- tinilous response. If there am S responw categorim labc1ed a,. .... as, the relationship between the observed and latent respame can bc for- nlulatd as a thslioM rnodd:

Page 121: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

n h e r ~ K,, s = 1 , . . . + S - 1 are threshold or mkpoint ~ ~ m e t e r s . The latc~~t, responsc is modeled % a linear r c g r e ~ i n n

~ h c r e c, has a logistic distribution,

Thc latent response and thi-whold model imply a logisistic rncltld for -hc c~mdatevc probahilitipm. The cumulative probabilit~r 7;, that ttht! r~syonsc g, takcs on a value greater thm aa brcorms

Tiv 5 Pr(w > a,) = Pr($ i" K ~ ) = Pr(& - x:p > rc, - x:P) - It i s clcar from this cxprewion that wc t~u ld not simultnneousIy

istimatc the constant & in the nod el for gr: and all threshold parme- I :FIX sincc we m~ild increaw the mns+;~nt ;urd all thresholds by the . m e

.ilnuullt withollt changing the rr~odcl. In Stata the canstnut is ~hercfm-c ~t to zero for identili~nt~iun.

The model 1s also callcd the pmp&io%al d l model ~CCFLUP~: the :og odds that uy > r ~ , are

9 that the log odds ratio fur two units i and j is (x, - x,j'P which 1.i independent of s. Therefore, exp(&) represents the odds ratio that ? r n, for any ,a when T~ incr~asw by one unit if all other covarintes r ~ n i d n the snmc.

In I>innry logistic regression for didloturnous rwpclnsas, a, = O , a2 = 1. h, = 0, mid e x ~ ( $ ~ j is the odds ratio that y = 1 A , I I P ~ x b in~~eme? 3s one unit and all other ravariate remain the $ m e . Note that a SifFcr~r~t idcntiiyi~lg rest,ricf,io is used than ror ordinal m p o w s : the -ilrcshdd fil is wt to zero instead of the cunstmlt in the model for ., . 1%~ proK~t and ordinal prohit rnodds cornpond to Ingistic and or-

41lnl lop t ic regression ~nodcls wit,h the cilm~llativc distribution hi~ie 1 -! an ~n . (6.4) replaced by the standard normal cumulative distribution

?i~lr,tion. MOW irlrormatioli on rnodels for ordirlzl data can be f o u ~ d in 1 5 r w t i (1996) aocl Ianp and har (2006).

Page 122: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

6.3 Analysis using Stata

The ASCII file tumor. dat contains thc loin by four matrix or freqlren- cics shown in Tnble 6.1. First wc read the data and gc~ierak *ter,~riahles for theraw and m using thc egen furi&ion ssq0:

infile frl fr2 fr3 fr4 using tumor.dat egen therapy = s e q o , irom(0) to(1) black(2) egen sex = seqo , from(1) to(2) byctherapy) label define t O seq 1 alt label values therapy t label define s 1 male 2 female label values sex s

block(2) cau.m the number in the seqnence (from O t,o I ) to he re- y e a t ~ d in blocks of two, urhcrcas by(therapy) cttusm th? scquencc to start from the IUP~PI l i r ~ ~ i t every time the value of therapy changes.

We ncxt reshape thc data to long, placing the. four lwcls of the outrxllnc, representd by- f 1 t o f 4 into a variable outc,

rashape long fr, icthesapy sex) j (outc)

and expand the dataset by replicating cach ohscrvation freq times so t'hat we haw one ohwrvatior~ per subjecl

expand f r

Wc car1 check that the data conversion is correct by tabulating thcm datn as in Table 6.1:

table sex outc, coatents(f req) byctherapyl

giving t1ie table in Display 6.1.

and sex 1 2 3 4

elt

female

Display 6.1

Page 123: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

To usc ordinary (binary) logistic rcgrcssion, we must dicllotomi~ -he outcorrlc, for example. hy ronsidcring partid and con1plclc rcniibc >ion to be an i m p r m r ~ m t and t h ~ other entcgories t o be no i m p m e :nPnt. Thc new outcorne variable can be gcnczated as follow:

recode outc 1/2=0 3 / 4 4 , generate(improve1

,or using

generate improve = o u t 0 2

The cornmuid logit for logistic rsgrwsio~l bchavev the name way ay regress md all other wtimation eommanck. For example, automatic ~ l c c t i o r ~ prr>mcIures can be carricd out ns~ng the atepwise prefix and 3ost-estimation com~nandti s u ~ h as testparm and predict are 8.v~il- a'nlc. First, irlclilde therapy as thc only explandory variable:

logit improve therapy

scc Display 6.2). The algorithm lakcs two iterations to convcrgc. The

:-.eration o: log likelihood . -1W.40888 ::eration i: log likelihood - -192.30753 I:erat~oa 2. log l a h l i h o o d - -192.30411

-2qistic regesE ioa Humbar of oba - 299 IR chiZ'(1) = 4.21 Prob > chi2 - 0.0402

2 8 likelihood 1-192.30471 Pseuao R2 - 0.0108

Improve I Cosf. Std. E r r . e P>IzI [95% Coni. Interval1

therapy -.4986993 .24435M -2.04 0.W1 -.977618 -.0197805 -Cons -.36150!2 .i6M238 -2.18 0.029 -.1857263 -.0372777

Display 6.2

-ocficicnt uf therapy reprcscnts Ihc difference in t . h ~ log odds (of 2111 -:!~provemerlt) bet.ween the altcrnaling md wqute~t.rd theropim. The xeqotive v~hw indicates thel scquer~tial therapy IS supcrior to alter- xaring t.herapy. The pmlue of the m~ficient is 0.041 in the tahh. T!~is was deriver1 h m the z statistic, which is given hy thc cocfi- - Pnt divided by its asymptotic standard error (Std. E r r . ) . Undcr - i e null hypothesis. that the true coefficient is z r o . the statistic has a =andm1 n o r r n a l dib$rihnt~on, and its squsrc, thc TSTdd statistic. i. )'-distributb~~ with one degree of kccdom. This p d i ~ ~ from this i i ' ~ l d lest is lew retiable t h a n thc p v a l ~ ~ c baved url t,hp likelihood ratio

Page 124: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

118 B A Handboboub of Stutistncol Analpws Using Slala ---

between the model including ody ~ h e cunstant and thc current model. Undcr the null hypothesis that the const.ant-only ruodd is correct, mi- nus m i c ~ the likedihoob ratio has an approximate distribution with one dcgree of freedom (btlcause t11cre is one additional paxanicter i11 the corr~peting model). Here (he likelihood ratio slatistic is equal to 4.21 giving a pmlue of 0.0402. very similxr to that b ~ t d on the 1Val1l t ~ t .

The co&cicnt uf therapy repmmta the different* in log odds h e tn-ecn the therapics and is not ea5y- to interpret apart from thc sign. Ercponcnt~ting the caeficicrlt giws the odds ratio and expon~ntiaiing the 95% confidence limits givw the wnfidmce interm1 hr Llie odds ratio. Fortunatply. the or option c m bc used to obtain the reqd1.d odds ratio and its confidmce intcrd directly (alt~~naFiwl!lr;, we could 1 use the logistic r:omnand):

I logit improve therapy, or

(sce Display 6.3). The staudard error now represents the approximate , Logistic regression

Log l i R e l l ~ c d = -192.3m71

Display 6.3

improve

therapy

standard crmr of the udds ratio (calculated using the delta method. see, e.g.. Apcsti , 2002). Since thc sampling dislrihution of t h ~ odds ratio is not wll approximated by a normal distribution, the IVnld stat.i.stic: and confidence interval are derived using thc tog odds and its standard error. Alternating Illerapy is associated nith a 100(1 - 0.6073201)% = 39% reduction in the odds of an impmment wrnparwl nith sequential theram (B5% ~ ~ n f i d ~ t ~ interval kurn 2% t o 62%).

To test whether the inclusion of sex in 1.he modd significantly in- creases the likelilliltood: thc currcnt likelihood (anrl all the cstima~es) can be savcd I L ~ U ~

Odds Ratio Std. ER. a P>ltl [QSe/. C m . ~ntervall

.&0732Ql .1183391 -2 W 0.041 .3762061 ,8804138 -

estimates store nodell

11lclrldiUg sex

logit improve therapy sex, or

Page 125: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

I&ltc Rqpw-erniwa H 119 - - - - - - - - - - - -

zites tllc wrtput shwtm in Disp1a.y 6.4. Thc pvnlur or sax baed 011

Ioylatic regreseion

:cg Likelihood = -iDO.d3171

Number of oba 299 LR .~bi2(2) = 7.55 Prob > chi2 - 0.0229 Pseudo RZ - 0.0194

Display 6.4

imprave

therapy sex

-he Tf'ald-ststiutic is 0.078, and ~ p v a l u e fur tfic likelihood-ratio test is &,#btainad using lrtest

Odds R a t i o Std. Err. a P>lzl [95X C o d . Interaall

.d051969 ,1486907 -2.04 O.W1 ,373QOS4 .ST95537 ,5197993 .1930918 -1.76 0 .OW .2509785 1.076551

3slihood-ratic teat . h s w i o n : modell nerted i n .1

LR chi2(1) - 3.35 Prob > chi2 - 0.0674

ahirh is not very different from the value of 0.078. In the l r t e s t colnrnnnd "." refers to the current model ilntl model1 i s the nlodcl ex- s.i~~di~lg sex which was previously trtored using estimates store. Wc 8:ould have specified thc n~odds in thc reverse ol.der as Stn-ta assumes -!lat the rnodcl with the lower log likelihood is nested wlthirl the 0 t h ~ rmdcl. Note that it i s essential that both models umpared in the ':k~lil~ood ratio b s t bc h w d on the same sm~ple. If sex h d missing -.dues, fe~t'cr observations would contribute to the model inclurling sex -han to thc nested model exdodirlg sex. In th is cav, wc would have -o rcstricl e\timation of the n e s t d model t o the .'estimation wnple" ,-.f the full1 rnodcl. If the rull model has been estimated first, this can i e achieved using logistic improve therapy if s(samp1e).

Retaining the mriah,hle SOX in the modct (although it is not signinif- -ant x l the 5% level), Ihe predicted probabilities ran be obtainetl wing =redict wit,h the pr opt,tion

1 predict prob, pr I

a!irI the four diffrxent predicted probabilities may he compnr~d with -he ohwrwd proportions a. follmw:

1 table sex, tmtents(mean prob mean improve freq) /// by (therapy)

Page 126: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Display 6.5

therapy and sax

seq male

female

alt mals

female

(see Display 0.5). Tllc erg~w~ne~l t is good, so thsrc oppatxs lo he no strurig interaction hetwccrl sex and type 01 th~rxpy~ (We c o ~ ~ l d test for a11 mnternclinn hetwccn SPX and therapy by using xi : logistic improve i . therapy*i. sex . ) A more fbrlnal mssmfl l t or the good- navs of fit (an bc ohtaincd r~sirg l:Iw caorxlxnnnd

mem(probl mean<impve) Freg.

,4332747 ,4296876 128 ,2843.346 ,30434TB 23

.3169268 .32 126 ,1938763 ,173913 23

egtat gof, table

which pri)duccs .;,he output showr~ in Display G.G. T l ~ c r ~ tire foz~r ilr~icltw wvnrinte pntlrrns ~ v c i ~ hy the comhint~tion~, of therapy a i~d sex. The irlclividimlr sharing n givexi covnrixts paltern are referrcd to xs group; t l e last two colnrr~ns shaw the corrwpo~~ding covnriate vnl~t~rn and the nrcolltl colin~in givw Ihc pr~cliclcd probahtlit,ir.a. The o h c i ~ r t l 1111111-

h ~ r of casc3s ill cnch gmup is giver1 untler Total and this is composed of Obs-1 ~ndividunls with n respu~lx Improve cq11~1 Lo 1 H I I ~ Obs-2 , il~dividunls wit11 a resprncic cqiid to 0. The cxpmtd niimt)cr of indi- I v i d~ tda wlth a 1 r~spur~sc , Fxp-1, is nbtdr~cd l7y rnulliplgi~lg the totnl nnrrlljcr of irl~livid~tds in each wonp by tthc predicted probhihility for the KIYIIIP. Finally, the cxp~cted nuni1)er of individuals wilh a O I*+

sponse is just Total - Exp-1. At thc hrltto111 o f t.he output, observed allti e x p ~ l c r l frquc!ici~,s arc cnrnpnllcd nsing Pcmo11 chi-squwc t,est. The null IgpothebEs tlmt tllc model is C O ~ M L ~ coilnot bc rejected here with a ywlue of 0 73. Notc that tE~c t,e,st vilnnnt rletec~. if thclc me in!- portunl o~nittcd covitriatcs 1,ecausc t.he tltblc of obwrved mid expected Fwq~icneiw is kigregated oar any corariatcs not includd i11 I llc model.

M'P now fit t l ~ e proportional otlds lnodd using thc frdl ordinal re- spun= vvilrk~blr outc:

o l o g i t outc therapy sex, o r I The results are dlorvn in Disphy 6.7. The odds ratios represent t h ~ 1 ~ t i m a t e d effwts or therapy and sex on thc odds of bring in "cam-

Page 127: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

L ~ ~ i a t i c m d n l for ~ ~ O V B , 400dlle~~-~f-f~t t 8 5 f

number of ob~ervationa - 2W lmber of cwariate pattarns - 4

Pearson chi2tl) - 0.12 Pzob > chi2 - 0.7310

Display 6.6

i d e r d logistic regression Humbsr of obs - 290 LA cbi2(2? - 10.91 Frob > chi2 - 0.0043

2 6 likelihood - -384 62832 Pseudo RZ - 0.0135 -

Dibplay 6.7

Page 128: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

122 A Hudbook oJ Stalirktd IInmEpacs Using Sttata ----------

plete remission" (category 4) wxsus hring at hest in "partid remis- sion" (categories 1 to 3); or or being at basl in "partid rrmksio~~" (cat,e.gorics 3 and 4) rather than having ''prugrcsdve diseasc" or "no change': (ratcgories 1 and 2); o~ of "no char~ge" or better (cabrguries 2 to 4) vcnus "progrc9sive disease" (calegory 1). The second interpreta- tion tmrrespn~lds t.o the dichotomization used lo fit the hiuiwy log~istic reg=mn modrJ. Thrreforc the odds ratio estimates shoulrl bc similar to those using binary logistic regression snhmvn in Display 6.4 and thqy nre. However: ttl? pvnluw arc kmer in the ordinal logL4tio regression as might be expected because information is lost: in dichutomizirlg the out.{:ome.

lt'e can IMC the ebtim;ttEq to calculate prcclictd probabilities. For irisLanw, Ihe predicted probability t,haL a nmlc (sex=l) who is rc- celvjng sequential therapy {tberapy=O) will be hi cornplcte re~tliusion (outc=.l) is (see equation (6.4)):

display .5819366*expI-0.758662)/ (I+ .5819366*exp(-0.758682) ) ,21615563

Howevm, n mudl quicker way of computing thc predicted probabilities for all four responses and all combinations of explanatory variables is to nae t11e predict command:

predict p1 p2 p3 p4

and to tabulate tllc results as fullowa:

table sex, contents(mean pi mean p2 mean p3 mean p4) /// by (therapy)

eving the tahk in Displq 6.8.

alt

U e ( ..323582l .3727556 .IT1358L ,1323038 f w l e .4511651 ,346427 .I209076 .Mi5003

Display 6.8

Page 129: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

.- Logistic Reyrss~ma W 123

------.----

6.3.2 Diagnosis of head attacks

, The data ill sck.dat aye rcad in using

infile ck pres abs using sck.dat, clear

Each obsemtiurl repcsents dl subjects with maximum mcatit.itic kintisc cliilrcs in the same interval and thc m~ial>le ck contai~is the lower limits of the inter\nls The total rlu1n1)er of subjertr i s prea+abs. r~llculated uslug

generate tot = pres + abs

and Lha xlurnber of subjr~ts with the diseasc ispres. The probability of pres "sllr:ccssd' in tot trials is binomial with "denominator" t o t and probability T,. Bir~anlid(tot, x*) . The prrjgrnlns loglt and Xoglstic are for daLa R.~CI .P each obsmvatitnl represents a. sirigle Bernoulli t,rinl, n-ith binomid "denurninator" cqwl to 1, Binomid(1, T ~ ) . Another mnlmand, blogit, call be used to rn~dyce the "$roup~d': data with ~~dcnomiriatwrs" t o t a5 conaidcr~d here:

b log i t pres tot ck

iw Display 6.9). T~ICPF: is 8 VCJY S i f f ~ i f i c n n ~ assor:inlion bcmecn CR

:nd the probah~liw of infarct (Note that the same c ~ ~ c i e n l ruzd p -:al~~c would be obtained uusinx the micl-point of carh intend sincc all :mcrvals arc 30 units wide.) Itre now investigate whether it is reasonable -0 wsume that thc Ing odds depends lincariy on CK. Thcreforc, we plot. -:ic ohserved proportions and predicted prohahiltties as follow:

logistic regrsssLon fm grmped data Mmher of uabs - 360 LR ~ h I 2 ( i ) 283.15 Prob > chi2 - 0.0000

log l~kelihnod = -83.88M07 Pseudo R 2 - 0.6013

generate prop = pres/tot predict pred, pr label variable prop "observed" Label variable pred "pred~cted"

-outcome

ck _cnna

Goaf. Std Err. a PIlnl C95Z Conf. Interval]

0351014 .IH)40812 8.60 0.000 ,0271053 .0431035 -2.326272 .2993611 -7.77 0 000 -2,913009 -1.730535

Page 130: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

twovay [line pred ck) Iacatter prop ck), /// ytitle("Probabi11ty") xtltle(CK)

TIE predict command gives predicted counts by defxrdt and there- fore t.be pr option way ilsed to obtain prcdicled prohabilitics illstead. In the rcsultillg graph in Fignre 6.2. the tun* fits the data reasonably well, the lnrgest discrepnnry being at C K=280. H m w r , the cilrve for

k k ~ ~ r e 6.2: Probability of infac t ct as function of creatin~ ICinsc Iwcla.

the predicked probabilities is not smooth. Using the plot-type m ~ p l i n e i n s t cd of line proddurn a arnoot-h curve, but a Inore faithR11 smuath C I I ~ can be obtained utning the graph tuoway plot-hype function mrnnland as follows:

twoway (fnnction yl/(l+exp(-_bC_consl--bCck3 *x)) , /// r q e I 0 480)) (sca t ter prop ck) , / / /

ytitle("Probabi1ity"l xtitle(CK1 /// legendcorder (1 "predicted" 2 "observed") )

Here we are using the repwion coefficients -b[-cons] md ,b[d] In cdculate the prd i c td proba13ility LIE x funclion of some hype t h d i c d vtiriahie x w y i n g irr the raugp from 0 to 880. This i~n- prov~vl graph is shown in F i ~ r e 6.3. Note that nv ocnld also use

Page 131: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

I Figure 6.3: Srootl~cr w~sion of Figure 6.2.

i h ~ invlogit0 function to cdcnIate the inverse logit, i.t.. function y = invlogit (-b [+cons] +-b Cckl *XI.

We will now plot honie rtssitluals and then consider the performance of CK as tt diagnostic 1001, d&ning the test a5 posit,iw if UK ew&s R

certain tfirrshuld. In ~art icnl~v, nv u71ll consider the sensitivity (proh- ability of a yositlve test result. if the disease is pr.r?serii) md specificity prohahility of a negative test rmult, if the r l iwn~ is absent) for diff~xcnt

rhrwltnlds. Th~xc dre some useful posbwtimation culnrnrutds avnilahlc for thcse purposes for use a f i , ~ r tlic l o g i t (or logistic) mm~nmld that are nut amiiahlr after blogit. Wr t1.1crefore transform the data into i J 1 ~ form rcquircd for log~stic, i.e., one obsermt~on per Hernoulli trial nit11 o u t ~ ~ m c i n f c t cqud to 0 or 3 so that the r~urnher uf rmrs per CK !&TI equds pres:

expand tot by ck, sort: generate iafct = (_nc=pres)

W> can rcprorlur~ the r~311lt.s of blogit usiuirlg log i t :

1.xe Display 6.1U). To judge if t l ~ p dixrepltncy hehcen the ohscrved anti expcrted pro-

portions 1s ~cceptable: we can use standarclizd Pearson residoals for

Page 132: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Logis~ic regrsaaion

Log l i k e h u o d - -g3 886W7

Iluehr of ohs - 360 LR chi2(l> 283.16 Rob > chi2 = 0.0000 Pswdo R2 - 0.6013

Display 6.10

i n f d

ck -cons

c x h ' 'mriatc pattern", i.e., for each combination of d u e s in the co- variates (herc for each value nf C K). These residuals may bc obtained and plottcd as follows:

Coef. Std. Err, a P>lzl 196% C m i . Interval1

.0331044 .0040812 8.60 0.000 0271053 .083iOSS -2 326272 .2993611 - 7 . n 0.OW -2.9i3W9 -1.73QSSS

predict sesi, rstandard tuonay gcatter resi clr, mlabel(ck)

The graph is shown in Figure $1.4. Thcre ar~ wvcral Iwge outliers. The

Page 133: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

largest outlicr xi? CK=280 is d u ~ to one subject out: of 14 not. having h ~ l an infarct ~lf~hough the predidd prohahllity of an infar& i s dmwt 1. {IVc muld dso tmt the goorlne<s-of-fit of the model; s w Excrcisc 6.3.)

\Ve now detc~mine the accirrocy of the diagnostic test based ua rhc logistic regression model. A chs~ificrateon tabk uf the predictrd diagnmzs (i~sitle; a cut-off of t h ~ predjct~d probability of 0.5) vQ6tra the rruc diagnosis Inny bc oblairicd usilx

esta t classif

gnring the table sllnwr~ in I>ispJay 6.11. Both the scnsitivitv and thc

Logistic nodaL for infct -Trim-

Classified ( D -a [ Total

+ 215 16 291 15 114 129

Total 230 130 240

Classxfind + if predicted Pr(D) >= . 5 T n a D defined as iufct I= 0

Sensitivity Specif Icity i o s i t l v s predictive Value Ysgatiua predictive value

False + rate for , F a -D 'alse - rate for trus D Talse c rate f a r classif ied + False - rate fm clseaif5sd -

Display 6.113

ipecificity &re rclativcly high. These charar,tcsistics are generally & s s u m ~ d l o gcn~fi~liw to other populations wll~reaq thc pvsitlw md nega- tive predictive vrtlum (probaltll,iliti~ of the diucme being present/nbscnt i F the Lest is positivcJnegativ~) depend on the prwalencr! (01, prior proh- abititv) of the condition (sec for examplc Swkett et al., 1991).

The use of other probnbility cut-offs cc,uld bc investigatrd using the option cucaff (#) in Ihe ahwe c:ommanti or using the commands lroc t.o plot e HOC-curve (specificity vs. sensitivity for different cut- off?;) or lsens to plot serlsitivity and specificil~v againbt clitratF (MC Exurcisc 6 3).

Page 134: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

The abwc cl~ssificalio~~ tahlc may be misleading bemuse wc we trnting the motlcl or1 t.he sarnr sample that wns used to dcrive it. A n altenlativc approach is tu mrnpute preclictd prohabilititv for each abscrvntion fronr tt mdcl fittd to I l~c r~maining ~bSPn:~tiu11s, This method. cdlsd "Icave one out" m ~ t h o d or jraehkPa$ng (see J,achcr~brt~ch and Mickey, 1986), can bc car r id out reJat.ivcly easily for our data ibc cwse we only 11aw a slnull nnrn'ucr of covariate and rc.sporise pattcr~ls. Instead of loopiiq throngh all olarrvations. excludirg ~ ~ 1 1 : obscwtion in Ihe logist,ic rcgreu.sion mrnlnand mid mmputing that observation's prcdicterl prmbahility, wc ctu! loop thruulgh a subset of ohservnt,ions raprcscr~ting all cornbmaliws of mvariates and W~OIIHPS found in the data.

First, label ewh unique covariate pattrrn comvcutivcly in a variable num using predlct with thc number optio11:

predict num, number

Now generat.(: first . equal to one for the first. observation in each ~ o i ~ p of i ~ ~ ~ i q i r e covOliat~ and respume patterns ~ n d zero :oc,t,h~rwise:

by num inf c t , s o r t : generate first = (-n==i)

(We co111d also haw, uscd egea first = fag(num infct).) Now de- fine grp, cqud to tlic cltnliilati vc sum of first , nhtnincd using the lurlctiun sumo. This vntiable ntunbcs the groups of uniquc mdriate nnd response patterns consendively:

generate grp = sum(fisst)

(An alternative way of generaling grp without having to first create f ~rst would be to 11% the oomr~rarid egen g r p = groupInum inf ct) .) NOR, determi~ie the nurtller n l unique ron~hi~lations of CK bvels and infiuc:t ataLus:

summarize grp

Vnrisble I Ob. Hean 9td. Dev. A i m Max

Ab theye arc 20 groups, we need lo run logistic 20 tiincs (for e d ~ ~ a l u t : of grp), excluding orw nhffcrvntion from grp to derive the model for predicti~lpg the probability Coc all ol~s~~.vatiat~. in grp.

First genwntc n wial~ lc , nxt; that c o ~ ~ ~ w ~ ~ t i v c l y latwLs the 20 ob- senqit,iunu to I>P ~xcluderl i r ~ turn:

generate nxt = f irst*grp

Now builrl up a wriablc prp of p i - c d i ~ t ~ r l l>rot>~hilit~m as follm:

Page 135: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

generate prp = 0 f orvdues n- 1/20 C

quietly logirtic infct ck if nxtl='n' quietly predict p , pr quietly replace prp = p i f grp=='n' drop P

J

The purpose of these roclr comxriancls inside the loop is to 1. dcrive the nlodsl mclllding one o k ~ r a t i o ~ i from grp,

2. obtain thc predicted probabilities p (predict produces rwults for t.he whtde samplc, not juvt the caliination ~ m ~ ~ p l e ) ,

3. scl prp to the predicted probitbili ty for dll observat-ions in grp, m d 1. drop p mr that it c m ~ he defined again in thc next itemtion. Here the quietly p r e h was used to prociuee no output.

T11e classification table for the jarkknifnd probabilities can he ob- t incd uxing

generate class = Cprp>=0. 5) tabulate class infct

class Total

16 7.16 231

T o t a l 130 230 360

gikjng Il~e same real t as before, although this will not generally be thc C&W.

6.4 Exercises

6.1 Treatment of lung cancer

1. h a d in the data without ixing thc expand mminanct, and reproduce the result 01 nrdind lugistic rcgrwiona by using the appropriate wights.

8.2 . Female psychiatric patients

1. C ~ r r y out significance tests for an amciation bPtncen depress and life for the data dcsrzibcd in Chapter 2 i i ing a. ordinal logistic r~pcsaioa with depress a ~ : dependent mi-

able

Page 136: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

b. logistic regrwsion with 1 i f e as lscIepcndciit wiithle. 2. Usc stepuise togeL11e1 wit11 logit to find a luod~l for pre-

dicting life vsirlg the data from Chapter 2 with diITcrcrlt sets of canclidatttt! mrirtblcs (see Chapter 3).

5.3 4 Diagnosis of heart attacks

1. Test t,he goodness-of-fit of the logistic: regresion modcl with ck as thc only trcplnnatory wiabIr iwlng R Pcwmn chi-sqimd tcsi.

2. To imprmc the rrludal fit., successively include first a quadratic t ~ r m of ck in the model, Lhcn a nihir Wrin. clc., deciding when to sLop using Wald tests Cor tlr, highest order lccrns.

3. Fur the chosen rr~utlel, repeat thc go~oodn~sr;-of-fit Lest. 4. Produce a ~ a p h qirnilar to that in Pigurc 6.2 for the chosen

inodel. 5. Explore thc use of estat classif, cutoff(#l, Iroc, and

laens for t l s C~M;PJI rnodd.

6.4 Psychiatric screening data

Hcrc we considw data h ~ u a stil(1-y of a psycliiatrit: screening q u ~ t w n n a i r e callcd the GIIQ (Gencsal B r d h Qumtiotionllajrc). The data arc frrorn Drr and EtrcriLt (2032). In addition to mni- pbting thc quetionnaire, subjcets nwe diagnosed ns clirlicdly deprewrl or not by a ppsyrhiatrisl. Here the qireqtion of intcrest is to determine how the prohnhility of bsirg judged rlepresscd (a "cwe") is relatcrl to sex m d t.he GHQ score. I

Tllc uarialllm in screening-dta are:

ghq: GBQ smrc sex: sex (F=fe~nde, M=rrialu)

m caees: nurrlber of caw m noncagss: rlumber of n o n - c ~ c s

I. Fit holh a Iirlear regrcwion and a logutic qrcssion for the probability of being a case with GHQ scorc as the singh PX- plnrmtory ~mr i~h lc .

2. Plot the predicterl prohabilitics frum amh modcl ragairlst GHQ score on I11c s a t ! diagram nntl co~nment on the two curves.

3. Fit a lo~istic regression modcI to the probability of b c i ~ q a cast using brkh GMQ score arid sex as explanatory variahlcs. Corfitrnct a suilfable plot to illustralc he rr~odd fitt-ed.

I 4. Inwstignhc whether the previous model ran h~ impovcd by

inrli~rling a sex x CRQ firore inlcract~un.

Page 137: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

6.5 Prostate cancer

The data analyzcd here arise from a study involving pdients wit.h carlccr o E the prostate (Brown, 1980). The aur~ was tqto detcrrnine whether a cml~hinrttiun of fiw variable c o ~ ~ l d he uscd to r~rr- cast whether or not the cancer 11ar fcpread to thc Iyruph nodes. siucc this form the bask for the trcatment regilne that should l)e wioptcd. TElc 53 patients in the study Ilad 11ndcrgu11c a lapavo- tomy to detcrminc nr~dzl involvement or not iu their caw. Here the rapunx ~trinblr is binary with ZPIO signifying the atxeu~ce and unity the prcsenw of noclal invoiverncnt.

The variables jn prostate.dta are:

I id: patient idr~tifier m nodal: nodal involvernenl (O=no, l=yes) w age: age uf patient at dingnusis [ p r s )

acid: levcl of x rnm acid phosphatmt (311 King-Annst.rong uriiLs) xray: rrsudt of an X-ray examination (O=t~~gative, I=pmitjve)

w size: size of thc tnrrloul* w det,ermined by s rectal c m i n a - tion (O=srndl, I =large)

n grade: summary ol the pathological gradc of the tmnour dekrmind from a biopsy- (O=lcss serious. J=morc .wio~~s)

1. Carry out a logistlr regrewio~l ~ i t h nodal as lha responsc and age and acid as cxplilnatory variables.

2. l~ltmprct, the odds ratios. 3. Carry ont z logistic regrcmiou For nodal invnlvem~mt using a11

five cxplmatory \mrittlrlcs m d inrqstigate which of these five varinblm are rtiost needed in t.he mmlcl. Usc both forward and borhard seloct~on protcdurm.

6.6 Satisfaction with hausing conditiorls

A p e t i (1984) diwilssw data on 1681 rmiclents of twelve arem in Copenhngrn that dlot+i investigation of thc effect of various factors on sntixfaction with the housirq conditin~is. (Tb data wc also dpscr ib~l in Madsen. 197fi.j

Thc data me in rollapsed forui with frequcnries g i w ~ ~ in the variable freq. The rernainitg variables in housing.dta are:

w satisfaction: level of satisfactioll (l=lon,, !-?=medium, 3=hii1) housing: t y p c of ~ I O I U I I Y ~ I ~ {l=tower block^^ P=npartments, 3=alsium hnuscs, rl=terrttcctl houses)

Page 138: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

w contact: degree of contact with midents (I=low, "thigh) w influence: fedirig of inflnencc on apartment managcmcnt

(l=Iow, 2=mediiim, 3=high)

1. Rt n pmportiord udddv model for satisfaction treating housing a? n cate~wrivd predictor and Lnflueace a9 a con- tinuous predictor. M a S snrc to use freqrrency weighhs,

2. Interpret. the estimated odds ratice. 3. D~scuss thc implicit axsilrnption made in treating influence

FL~ mntinuous. 4. Uw a likclihmd ratio test to dccide if influence should

treated as cate#zn3cal mstmd. 5. Test for an irlterxtion betwccn housing and contact using

holh a rr~~~ltivariate TVdd test (l~qiing testparm) and n likeli- hood ratio tcst.

Page 139: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 7

Generalized Linear Models: Australian School Children

7.1 Description of data This ct~tvpter ~~etnalyzm a nunhw of datasets disrusscd in pr~vious rllapters a11ld. in addition, clcscribe~: the analysk of a new data.* given i11 Aitkin (1978). Thesc data c:ome fcorr~ a snciological study of !Its-

trxlian aboriginal mr~d rvl~itr cliildrm. The sample int:Iuilcd chilrlrm frmn lour age groups (final ycar irl primary sdiod mrl Arsl Il~ree Fars m s c w ~ ~ d n r y school) rvho wcrc clasifid aq slow 01 amr~agc leaiim~rs The nn~nber of da.r, absent frort~ sd~uol rlurinp the scl~ool ycar was rmordcd fur earh child. Thr data are given in Table 7.1. The variables RTP AS ~ O ~ ~ U W S :

a eth: ctllrlir: group (R=ahtborigii1a1, W=whit,e)

m age: class in school (FO, F1, F2, P3)

Ira: BVt:rX,m or slow I~arncr (SL=slor!- lcarncr, AL=average learner)

days: ni~nihcr of d q s abscnt frvm scl~ool iu one year

One aim of thp a~lalysis ih to invmligate rtllnic diKcicrrmws in the mcmi nunber of d q s absent frfrrnn scllool n:liile conlroIli~ig for The othci ptr tcutial p~ediclors sex. age, and lm.

133

Page 140: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

134 B A Hadhook of S l a W u u l A d y s ~ s Usang Slnta -------- -----

7.2 Generalbed linear models Prwiolln chapters Imvc described h e n s (Chapter 3) aid Iogistic. r c g m sjon [Chapter 6 ) . In this d i ~ p t ~ r , ive will &scribe a mwe general claw of niodels, called generalized linear rradclu, of whirh linpar repcssion and IogiYtb regrewion are spHcial cases.

Both linear and logist,ic regr~ssiorl involve a l i n a x combination of thc explanalory variables, rand the h e a ~ p d z d o ~ , of the form

In both typw or regcssiurl, the linear pwdictr~r dttermir~cs the ex- pectation & of the response variable. In linear regression, where the rcspunse is continun~w. p, is directly equated with the linear predictor. Th~s is not odvisahle when the rwponsc is dichotumn~~s hwwse in t.liis case Lhe expedrstiori is n prohubility whidt mutit satiqfy 0 j 6 5 1. In logistic regrmqiori, the linrar predictor i s tltcneforc q u a t ~ l with a function of p4. thc lo&, 7, = log(p+ J (1 - H)). In generalizwl linear niodels, thr linear predictor m q be equated with nny of a nurnbcr of diffcrmt hlnctions g(,t,) of pT, called hnk Jlactzon.v; tlmt is,

In linear regression, the probability dist.ribubion of the mspnnsc v 4 - ablc is assumncd to he normal with mean 16,. In logistic regression a blnomial dist,ribution is assumed with probability pwan~cter p. Both the 11ormd and 1,inomial distribut.ions mme from tlic # m e family of dwtrihutions, c d l d Ihe exponential family,

For example, for the norrrlal distribution,

so that 4 = p i , b(Bi) = B:/2, h = oa, and a($) = 6. Thc parameter 8, can he writt.cn as n Rinclion of / r , and this function

i s rAld t21c cenonicui link function. The canonical link is frsqucntly

Page 141: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf
Page 142: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

chosen as the l i~ ik h~ictiou (and i5 the default link in the Stat.a com- rrrad for fittrng ge~ieralizd linear rnorlels, glm), alfhongh t.he cmnical link is not nccmarily more appropriate than any other link. 'rtble 7.2 lists soinc of the most conimun distributions I L W ~ in generalized linear mudels and t h e r canomcal link functions.

Table 7.2 Probability distributions and their canonical link functions

Variance nisnamtoii Link Diytrihution function parzmctcr functiou g(p) = Otp) Wolor nlsl 1 u1 i d e ~ ~ t i ~ v u

The ronditional mcan and variance of arc given by

and I where h'(Bi) and V(61,) denote the first and sccond der i~a t iv~s r ~ f h(-' evaluatctl al8,. arid the variance function V ( k ) is obt;tinRd by e x p m ing hl ' ( f l , ) ns a function of 11;. It CR~I be seer1 from (7.4) that the variance for the normal clistributiun is sinlply 0' rcgardbes uf the w h i p of the rrbem pi i e.: the variance fr~nction la 1.

'I'~P data on Australian school childrcn will be atlalyzed by anwum il~g a Poisson di~ttnl>ut,ion for thc number of days xhs~nt from school Thc Poisson dist-rihution is thc appropriate di5trihrition nT the n u m k OC cvcritv uLwrvpd over n pcriod of time, if k h w mrrilt,? occur i n d ~ pcndrnlly i ~ i cuntinuous time a l n coriatwt inst,amnnmils probabili~ ratc (or incidcrice rat*); we for example Claytor1 and Bilh (1993). Tb Poisson distrib~ltiun is given by

f (gT ;b )=pye- "7 /$ / , ! , p , = 0 , 1 , 2 , . . . . (7.-

Taking the logaritllrn ,and sarxlming over obwrvntions, Ilic 1% hl+ lihood IS given by

I ( P ; Y ) = E u w ~ n w - P,) - ~ ~ I R ! ) } (7.5 t

Page 143: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Gmpwlizd Linear .Wa&b: Atastrufiav SchotjE Ch~ldwrr . 137 - - - - - - - - - - - - - - -

a, that 0% = Inp,, b(U,) = cxp(6,). $ = 1, a ( i ) = I, and vnr[K\&) = -p(8,) = 11%. Th~refom, the vnriance of the Poison dibtrt bution is r.11 constant, bill equal to the mean. Unlikr the narnl~l d~stributlon, -he Poisson distribution has no xeparale parameter for Lhe variauoe g d the same is true of the hinomid distribution. Table 7.2 shor!*i fhc --.riaricc fimctioilv and dip mi or^ prumetcrs for mme co~nrrlonly used ;robabiiity distributions

'Lack of fil mnv he cxprcswl by the dcviance, which is rnirius twice thc !.Terence bet.wcen the rnaxi111j7ed Ing likelihood of the model nrld tlie :~uimum likelihood achiwablc, I.!?., thc maxiniized likelihood of the ,?!I or saturated model. For thc ~lormirl distribution, the deviance is 'nlply the rcsidllnl sum of squares. Another measure of lnck of fit is -:lc gencraliscd Pcarsoll X2,

rhlch, fur the Poisson distribl~tion, is just tlic familiar Pearsuri chi- q11arnrl sl~tistic for hvc-way ~rmtabulat ions (sinw V.(Gi) = GI). 3oth the dtrviancc and Pearsor~ X2 havc asvrnptotic x2 dlstr ibut i~~~s x d c r the null h:gothesi~. When the dispersion parameter q k i s d

1 not estirnst,t~d), an analysis of deviance can be used for compari~q r.+ted nlodels. To t ee l the mill hypothsis that the restriclious leading 1 -.> the nested mudel twc true, the clifference in dcviance bctwecn t w o r?.uclels is comparcd nith the x' distributio11 with dtgrm of freedom wud to the differcnoc in modcl degrees nf freedom.

I - Thc Pearsorr md dwimcc resid~rais are dcfined Ihe (signctl) q u n r a roots of the cont,ributions of the individual ubsermtions t,o thr ?~drson X2 and dcviitnce rmpwtive1.y. T h e e residuals 1 h q be w d lo

1 ssess thc appropriateness or t.hc link anti v m i m ~ ~ finlctions. :I relatively rornlnun phenorrlcnori with munt data is ove.d~qperszon,

1 :.e . t,he variance is grenter tllari t h a t of the aw~nicd disfrib~t~ion (hirie -?in1 with denontinator greater t h 1 or Puissnn). This wcrdispwuio~i 3.w. br due to ~xt,m vnriaLility in thc pmruricter g, which has w t hccn wmpletel\ rxplair~cd by t t ~ c rmttriatm. One way of ndrircssing t l l ~ xoblem ir to dlow pt to vxy randomly nwording t r ~ some dirtrihn- ion i u~d to asuulnc? that wr~ditiomtl on p,, the resporsc vmiablc follo\w

I 'he bimmjal (or Poiason) diatrihu~ion. S~ld , mod& are caILet1 d o m + - f v ~ f s ~ n d r l ~ : SPB also Chapter 9.

A lilorc pra~natir: w q of aceouimodzting wcrdispemion in thc -lode1 is to Lwurna that the variance e proporticmal to the variance

Page 144: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

f~~nction, but t,o cstimate the dispersion or %:ale parmcler 4 rather than aesilrning the d u c 1 spprop~irate for tit? disLribut,ions. For the Poisso~i distribution, the c-arbcc is modeled as

where d is wt.i.imated Irom Ihe tleviviarite or Pearson X2. (This is anal* gens to the cstirnation of the residual varmcc in linmr regrcsion mod- clu from the residual bilrns uf squares.) This prtramelcr is bhcn used to scal~ the sstimatad standard errors uf the wgrmsion coefficients. This approach of assnrning a variance functiorr that duos not ro~respond to my probability distril>utian is ari exmlple of the quwi-likdkelihood a p pro8cc.h. If t.he variance iu not proportional to Ihe 1arim1r.e function. rohiifit standard errors cnn he used as described in t11e next section. See McCnllagh aid N~Adcr (1989) and Hardin xnd Hilhe (2006) for rrlora details on g~ncralized linear rnodels.

7.2.2 Robmt standad ermra of pammeter estimates

A very useful fe~ltlrrc of Stata is that robust. standard ermrs of esti- rr~ahl puramcters can be obtained fur most estirrllttion commands. In maximurn likelilworl cstirnation. thc btnndard errors of the e%tirrlated parameters are derived from Ihe IIessian (matrix of swond derivatives with reipsct to the parametcrsj of the log likPIihood. However, thm standard errors a e correct orlly if tit: likelihood is thc hue lrkclihood of thc data. If t.h~s assumption is wt mrrrct, for ~nstarice due to omis- sion of coxariatcs, rnisspec~ficntion of the link functiun or pruhsbilip distribution function, wc can sciI1 us^ robust estirnatcs of thc standnrd errors known ds t,l~c Hl~bcr, Whitc, or sandwich varim~ice estimntw (for details, see Rir~der, 1983).

I n the cks~xiption of t1.w robust variance estimator in the Stisto User's Guide (Scctitm 20.14), it is poirtted out thaL the usc of robust s~andarcl error* irnplia a 1t.m unbitious int~rpret~atiorl of thc parameter estimates and their standard crrors than s n~orlel-based approach In- stead of n~sumiag tlmt the modcl is "true!' and attempting to estimate "truc" pammetPrs, we just consider the properties of the estimator (whatever it may maan) under repeatd sampling and definc the staw d u d error as i t s sampl i r~ standard devintiw.

Anotller rapprnd it) estimating the standard err013 without m&- ing any d~str~bi l t iond n~surr~ptintls is b o o t ~ t ~ o ; p i w ~ ~ g (Efrorl and Tibshi- rani, 1993). If U'P could obtain repeat-~d samplcs from the population (from wliich our dat,a were sampled), we oodd ohtairr an empirical s m - pling distribution of the parameter estimates. In Siuntc Carlo sitrlnla- tinn, tit. required m~nples are drawn from the. assumed dist,ribution. In

Page 145: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

.orstrapping, the sample is rwampled "to nppmimntr. whak ~vould I +pen if the population were sampled" (Manly. 1097). Bootstsap- - .sg tvorlcs as fo1ioa.s. Take a random sample uf n olwcrvntior~s (n is -'.P sample size), with replacement, arid estimatc the regmion coefi-

=-uts. Kcpcd this a number (31 t-imcs to obbPtin & sarnple of crtirnates. =:,,rn the result.ing sample of parameter estimates, obtain the clnpiri- - :: xwianccr-corariznce mat r ix of the parameter pstimatrs. Cor~firlcnce r.:er~%Is may he constr~~c.;t& using the esti~imated vxriance 01. clirectly '::~111 t11r appropsiahe centilas of t,hcl empirical di.tribiltion of parameter : -~ i rna t~ . Sec i\?Iai,nly (1997) and Efron and Tihskarri (1993) for Inom - :unnrttion m the hods t r~p .

3 Analysis using Stata - . .!c glm comrna~ld can he zised to fit generalized linear ~nodels. The - - -ntm is analogous to logit and regress m c ~ t tlmt the npt.ions i s i l y 0 and link0 are used to specify the probability distrit>ution t rhc rmponsc and ?.he fink function. respcctivcly. 1% first analy~e

- =ta from the previous chapter t o show hm linear regcssia~~: ANOVA. :.<I log~stic regression arc perfrlrrncd using gIm 2md then move on to

..::, data on Australian school cl~ildrw.

7.3.1 Linear egwssion

::,st. \ve show how linear regrcssio~l can hc carried out using glm. In :'iaptcr 3, thc 17.5. air-pr_rllution data we're read in using the inst.ruc- ' 0115

ln f i l e strlO town so2 temp manuf pop wind precip days /// using usalr .dat , clear

drop if tom="Chicago"

F ! L ~ now wc. regress so2 utl a riitrnhcr of wiahlcs wing

glm so2 temp pop wind precip. -family(gaussian) linkiidentity)

-CP Display 7.1). Thc results ore irter~tical tu &(EE: of the regret;rior~ ..nnlysis. The scde parameter givcn on the vighG11zaud side nbmv the :.>zr~ssion table 'PEpTFsElltS the resirlnnk vasjrrm~rr givcn ur~der Residual YS in the analysis of variance table of thc rcgr~ssion ma1,pjs in Chap- -i.r 3. 1\F can estimate robust standard crrors using the vceCrobust) sprinrl

1 g l m so2 temp pop w i n d precfp, family(gauss) /// I link(idsnt1ty) vce (robust)

Page 146: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

140 E A Handbook of Shl~lrclal A d ~ s e s Uslnq Stata -

Gsnaralized linear modals Gptimisat~on : ML

Dsvlancs - 10150.16199 peer son - 10160.15193

Variance function: PCu) - 1 Link imctiou : glu) - u t o g likelihood * -167.48183i4

no. or O ~ S - 40 Rasidnal df = 95 Scale parameter - 280.0a43 (l/df> Deviance - 290.0043 tl/df> Pearson = 290.0M3

CGauesianl [Idmtity]

IIC - 8.624242 BIC - 10021.04

Display 7.1

(see Display 7.2) giving slightly different stmdmd errors, suggmting 1 that some assumptions rriay nut he entirely satisfied.

We nonl show how a11 ttnalysi~ of variance mudel can he fitted using glm. usi~ig the slimming clinic cxmple of Chapter 5. Thc data me read using

infile cond status reap using slim.dat, clear

and the full, Fnturnlcd rriodel can he obtained using

xi: glm reap i.cond*i.status, family(gaussian) limk(identit7' I (see nisplay 7.3). This rss~rlt is idcnlicnl to that obtained using t b command I

x i : regress resp i.cond*i.status

(we exwcises in Chapter 5) . We can obtain Ihe F-statistir:~ for thc intcractiori term by sari^!

thr clcviance of the almve modcl (miclual Mlm of sq~iares) in a lw macro and refitting the model with thc irltera~%ir~n wrnnv~d:

local devi = e(d8viance)

xi: glm resp 1.cond i.status, family(gausslaa> l i n k l i d e n t i e

Page 147: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

>neralleed linear models :;:imizatian : ML

Yciance function: V(u) = 1 :zL tuncticm : g(u) = u

I

No. of obs = 40 nesidual df = 35 Scale parzmeter = 2W.0042 I l / d i ) Dsvlanca - 290.0045 [ l / M ) Pearsen = 2BP.0043

InaturzJly coded: -TconLl. omittsd) betarally coded: -Istatv=-i amltted) (coded. as above)

No. of abs - 34 Reslduel df - 30 Scale paramtar -: 35.9616 t l / d i ) Deviance - 35.9816 { l l d f ) Pearson = 36.9816

IGansslml [Identftyl,

AIC - 6.53048 BIC = 973.0573 - - -

Page 148: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

1. cond -fconP1-2 i. ~ B a t t m -r~~atus-l-a

Geaeralized linear model6 Optimization : FIL

Deviance - 1078.97844 Pearsw = 1078.97844

Variance function: Vcu) - -1 Link fmction : g(u) - u

(naturally ctded; -1crmd-l omitted) (naturally coded; -I.statua-1 omitted)

Ilo. o i O b 5 - 34 Residual df = 31 S c a l e parameter * 34.M676 (1Cdf) Deviance - 34 BO57b (l/df) Pearson - 34.80576

[Gau~aiarrl [Identity]

AlIC = 6.475757 BIC - 969.6613

m J P

- I t a d - 2 - I ~ t m u ~ _ 2

-cons

OIH Coef. *a. Err. t P>IzI t95X Car. Interval1

.5352102 2.163277 0.25 0.805 -3.704734 4.776154 6.989762 2.167023 2.76 0.006 1.742463 10.23706 -7.199832 2.094584 -3.44 O.OOi -11.30514 -3.094524

Display 7.4

(see Display 7.4). Thc iilcrconc in dcviarlce cawed by the removal of t.he inttracbion term rcprcscnts Lhc sum of syunres of the iuterwtion term after eli~ninating the mail1 effects:

local dev0 = e(devt.ance1 local ddev = 'devO--'dsvl- display 'ddev-

.I3031826

and the F-stxtist~c is simply the mean sum orsq~iarcs of ~ h c inLrractior ttrrrl after eliminating the main affects divided by the resirlurtl mew square vf the fill1 muriel. The nilmeratator and denominator degrees d fsw~lrmi arc? 1 m d 30 respwtively, so that F and thc nssociatcd pvalw m;ty he obtained as follows:

Local f = ('&¶ev-/l)/('devl-/30) diuplay 'f'

.00362982

display Ftail(l,BO, 'f -1 ,95239704

The gcncral mcthod for testing the difference in fit of two nestpi gcncralixcd lincnr modcls, usiri:: the difference in deviance, is not a;- propriate hwe hecausc Lhc scnlc pwamcter & = a h a s estirr1at~- Notc thal the atest in tlre regression table In Displxj~ 7.3. AS well e;

Page 149: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Gmru4i~d Linear Mod&: Auslmfdon School Chi!& 1&i

-he chi-squared test performed by testparm -XconX*, assume sampling I~strilmlions that are appr~priat~e if the dispersir~n parameter is known ir ror Iargc retridud degrees of frwcloom. Hcrc the pvalue from the F-test is identical to thrcc decunal plxw tu that h r n the *testt.

il-c nntrw repeat the lo~stir: regression analysis of Chaptex 6 using glm. iVe first rend the tumor data .as before, without replicating rcmrds.

inf i le frl fr2 fr3 fr4 using tumor.dat, clear gen therapy = int ((-n-1)/2> soxt therapy by therapy: gen sex = -n reshape long fr, ittherapy sex) j(0ut.c) gen improve = outc recode improve 1/2=0 3/4=1 list

6 . 7 . 8 . 9.

ID.

therapy sex mtc fr improve

0 1 1 2 8 0 o I a r s o 0 1 3 2 9 I 0 1 4 2 6 t 0 2 1 4 0

0 2 2 1 2 0 0 2 3 5 i 0 2 4 2 1 1 1 $ 4 1 0 1 1 2 4 4 0

1 1 3 2 0 1 1 1 4 2 0 1. I 2 1 1 2 0 J 2 2 7 0 1 2 3 3 1

1 2 4 1 1

The glm oon~mand can be ~ w d with the logit link and hiriornid :istrihution m ~ d with fr LS frcqllency weights us111g

glm improve therapy sex Ifueight=f r] , f anily(binmla1) /// ImkI logi t )

-ee Disyday 7.5). The likelihood ratio tcst for sex can br utrlainecl ns fnllows:

Local devl = e(deviance)

Page 150: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Genexalized linear models Optimization : ML

kv isnce - 381.2634298 Pear~on - 298.7046083

Variance function: V(u) - u*(l-u) Link iunctlon : &(u) = luCu/~lw)l

Aa. of obr - 299 Res~dual d f - 296 Scale parameter - I (I/df) Deviance - 1.288052 ( l / d f l Pearson - 1.009197

D3ernoullil [Logit]

AIC - 1.295195 BIC - -1306.D68 Log l l b l i h o o d = -140.6317149

therapy -.5022014 ,2456898 -2.04 0.041 -.9837445 -.0206582 sex 1 -.6613125 1714139 -1.76 0 , B i B -1.382188 .D71ie11

-cone .385M96 .4614173 0.85 0.303 -.498951P 1.270571

lmpmve

Display 7.5

OIM Cosi . Std. Err.

qutetly glm improve therapy cfvei@t=frl, /// famlly(binornia1) link(1ogit)

local dev0 = e(devimce) d ~ s p l a y 'devOe-'dsvl'

3.3469816

display chi2tail.(i,'dev0'-'devi*) . N79645

which gives the same rem~lt .w in Section 6.3.1 whcrc wc used estimates store and lrtest.

7.3.4 Austmlian school children

We now move OII to analyac thc data i11 Table 7.1 to irivestigatte && ferences betwccn nboriginnl and white children in the mean ni~mber of days ahsent. from scliool after controlling for oLhcr covaiatm. The data are available as a Slatn lilt quine. dta and uiay therefore be re& simply by using the mmrnand

use quine, clear

The varhhles are nf type st,ring and can he converted lo riumeric us& the encode command ss follows:

encode eth, gen(ethnic) drop eth encode sex, gm (gender)

Page 151: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

:Top sex rgcode age, gen(c1aas) irop age e x o d e lm, gen(slaw) 2 0 p lm

1 iw ullnBm. of ddldran in en($ of tl~e conll)inntions of catr?;r>ri~s r ~ f - , - x e r , class , and slov cw t3c foiuld u ~ i n g

yzble SLOW class ethnic, contents lf req) by (gender]

- Diqhlny 7.6). 'Uds rcvenls that ttllerc 110 "4ow learners" in

-.able slow class ethnic, contants(mean days s d days) / / / by (gender) format (%a. i f

= -:::ow

St

hL

- -. Display 7.7), u~herc tlw f omat() option causes only a single deci- . - - 1)Iace l;n be given. This I d l e snggcsts that t,he variance ~ssoclattxl --!I thc Poissut~ dislribr~tiun is 11nt appmpritrbc! here <as sq~mrir~g t,hc

::Oar13 deviations (to get the vwian.nt:cq) res~rlls in V ~ ~ I I F S that arc ; .?tar tha11 the mcmls: i.e., tllel-e ia otxrdispersion. In this c&e, t11e

rdispersion prohnhly arises froa~ sulalrtnt,ial variability in cliilrire~r's :, irl.tying tendency to m h d q s of schr)ol t.hzlt cannot be fully ex-

:inril by tllc vwiablas we have inr:lnded in t,he model. ]poring the prohlcm uf uverdi.pcrsion for the moment, a gcricr-

.~td lir~ear mode) with a Pr)issur~ farnib and hg link can bc fitted .-::I$

sthnic and clasa ------ - -n- PO P1 F2 F3 M PI F2 PB

A L 1 5 1 9 4 1 1 0 1 10 8 1 11 9

6 2 1 7 6 2 7 7 S L 3 3 4 3 7 9

glm days slow class ethnic gendar, fanIily(poisS0n) link(1og)

Page 152: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Display 7.7

geder anda lov

F AL

SL

n AL

SL

Steratiotl 0 : l o g likelihood = -1192.3347 Iteration 1: log l i k e l ~ h o d - -1178.6003 Xtmution 2: log l~kelihood - -1178.6612 I te ra t~m 3: log l lkel ihmd = -1178.5611 Ganeraliasti llmear models Ao. 0 1 obm I 146 Optim~zatim : ML Rssidual dl - I&!

Scale parameter - Deviance * 1768.54629 (i/ef) Deviance - 12.5435e Pew Im . 1990.142857 (I/df) Pearson - 14.11Kc

variancs fmcttm: V(u) - u [PolssOnj Link functron . g(uE = lntul C h gl

AIC - 16.2131' Log llkelfhood - -1178.561184 BI C - 1065.9F

ethuic and claso - A - - M- FO F1 F2 P3 FO F I P2 F3

21.3 L1.4 2.0 14.6 18.5 11.0 1.0 13.5 17.7 8.5 14.9 10.7 8.9 11.5

3.0 22.6 36.4 25.0 6.0 6.2 18.7 26.6 4 2 6.0

13.0 10.6 27.4 27.1 5.3 3.6 9.1 27.3 8.0 4.9 14.7 10.4 6.4 0.7 9.5 22.9

9.0 9 . 0 37.0 30.0 8.1 29.3 6 .2 6 .2 23.4 32.6 6 .1 7.0

om davs I C o e . std. Err. z P,lzl [9bX Conf. Interval:

Page 153: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

-t~ Display 7.8). The algorithm takes lhrcc itcrat,ions to convergc - - :he r~laxiniurn likclil~ood (or minimum deviance) solution. In the :-*rice uf ovcrdjspcrsion, r . h ~ scale parameters b w d on the Pearsun .. .-. - or thc dcviance shollld he dw to 1. Thc values of 14.1 and 12.5

->.YIL at the top-right), rapar . t~v~ly , thmefore indicate that there is -.

- -,rdispan;ion. Cox~sequ~~t ly , the confidence interval? arc likdy to he - - narrow. McCFullagh and Nelder (1989) usc the Prarsorl X' dividcd - - -he degrees of frccclom to atirnnt,~: the d c parameter for the qimi- -.-ilhuud nmthod for Poisso~l models. This may be achi~vpd sing the r - ion scale ( ~ 2 1 :

glm days slow class ethnic gender, family(poisson) /// link(log1 scale(x2)

jle Display 7.9). Allowing for mwdkperuion has no effect on re-

-z+ralii;ed linear models -- - - :- .-ization : ML

fer:ance . 1768.64529 h%-son - 1990.142867

- ~ : 4 a c e funct ion: V(u) - u 2 ' U n C t l O U : g(u) = 1n(u)

No, O f 6bS = 146 Bes~dual d£ - 141 Scale parameter - 1 (I/df> Devlanca - 12.54358 (i/di) Pearson - 14.11449

tPolason3 Log1 AIC - 16.21317 81'2 = 1065.957

nIn davs I eoei. n a irr i n I Z [Psi: sat. htrnnl

I : x d z d errors scaled using square root of Psarson X2-based dispersion)

Display 7.9

r.iiion cueficients, but a large e4Tw.t on the yvaIucs and confidence --.%I \%Is so that gender md slow are now nt) L~ngm significant at the - lcrcl. Thcse terms will be i-e~novod horn the rnodrJ. Tlie cocfficicnts -a be ir~terprctcd ay the differences in the logs :soC the predicted rnean - rnts betwccn groups after rontrolling fur tllc othcr varlxR1w. For ex- =zple. thc log or Ihc pfedict.Ad mneau riurr~bcr of deys absem hum school. ; - u h i k chilr1re.n is -0.55 lower than !,hat for al>ori@alu uftcr con- ---*lli~y for slow, class, and gender. Exponcntiating the u~efficients

Page 154: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

yields ratios or expected counts (or rate ratios). The glm command exponentiatcs dl coefficienGs ancl corlfidcnce intcrvds wl~cn t,he s f o m nplion is used:

glm daye class ethnic, family ~po i s sm) l inkclog) /// scale(x2) efcrm

(scc Display 7.10). Therefore, white chil(1reu are absent from sdlool

Generalized linear models qptimiaation : ML

Variance func~ion: Ytu) - u Link function : g(u) - lntu)

Log likelihood - -1206.979185

NO. o f O h 5 = 146 Residual d i - 143 Scale paranetsr = 1 (l/df) Deviance = 12.75162 (l/dt) Pearson = 14.82445

[Poissrml [log] AIC - 18.66136 BIC - 1110.826

aljout 58% a< olten as aboriginal children (95% ccnnfidenw interval Crorr~ 42% to 79%) after corltrolli~ for claas .

I%'e h ~ v c treated class ns a ronlinuous comiatc. This irxlplias that the rate ratio fnr two categories is a oonstnsd mulliple of thc dierenee in scores assigned to these categories; for example the ralc ratio tom- paring class* Fl and FO is th? same as that comparing F2 m d F1. To see whctiier t-his appears to hc appropriate, wc can form the square oi class and include this in thp model:

days

class e t h s l c

gen class2 = class-:! glm days class class2 e t h i c , f mi'ly (poisson) link (log) ///

scale (x2) ef o m

aIn 1164 S t d . Err, e Wlzl t95X Conf. Interval1

1.177896 . DB95258 3.15 0. a31 1.014872 1.367105 5782531 .0924981 -9.42 0.001 ,4226284 .7911836

(me Display 7.11). This term is r~ot sigaifimnt at thc 5% lwcI so we can return to C11e simpler model. (Note that, the interaction b c t m class ant\ ethnic is also not siguifirant, see cxerciwx.)

\Ve now look at the rcvidr~rtls for this model. The post-estimati~ command predict that WAS uscd for regreaa and logistic can b

(Standard errors scaled using apmre r m t of P e a r s M X2-baaed diapwraion)

Page 155: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

-- Cencacralb~A Lwmr Mode&: Australian School Childma 1 149

:+-I~MCB = 1822.6B0172 -emsaw - 2OB1.25943e

7~:iance function: V(u> = u :-& r ~ c t i a n : g ( u ) - ln(u)

No. of obs - 146 Residual df - 142 Scale parmetar - 1 .(iJdf) Deviance - 12.83493 CI/dfl P e m n - 14.85878

[ P O ~ E S ~ I [Log] AIC - 16 55975 BTC - 1114.688 :?q l i k e l i h o o d - -1206,618626

class 1.0593Q9 .4543011 0.13 0.893 ,457 1295 2.456158 clsss2 1 090512 ,0825501 0.25 0.802 ,8708906 i.iP6B39 ethnlc ,5784944 .092643 -3 43 0.001 .422m25 .1917989

days

6randard errors scaled using square root o f Pearaoa X2-based di~parnlon)

Displq 7.11

OIR IRR Std. Err.

~sed lwre a.9 well To oblain standardized Pearson residuals, iwe thc ?earson option wit11 predict and divide t11c residl~als by the square root of thc est.imated dispersion parmeter stored in e(dxspersp-pxl:

quietly glm days class ethnic, f aaily (poiason) link(log1 /// scde(x2)

predict resp, pearson gen stres = resp/sqrt(e(dispersp_ps))

I h e residuals are plotter1 against the linear predictor using

predict xb, xb tuoway scatter s t res xb, y t i t l e ( " S t d a r d i z e d Residuals")

d t h thc tas~rlt shown in Figure 7.1. There is one large outlicr with a standaxd~ml Pemoo r~sidud

qcal,cr than 4. In order to find olit whirh ubscmtiori this ifi, wc :ist a nunher of variables for cases with large standardized Peamon r~sidunls:

predict mu, mu list stres days mu ethnic class if stres>2lstres<-2

lsee Display 7.12). Case 72, a white primary school child, has a very largc residual.

W e now a h chcck the ass~imptirms of the modcl by wing: rohilst stand;trrl crrors:

I glm days class ethnic, familylpoisson) link(log) rce(robust)

Page 156: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

2.2 2.4 2.6 2 8 3 32 linear p d ~ c t m

Figure 7.1: Standardized residuals against linear predictor.

Display 7.12

Page 157: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

-. G c ~ # ~ ~ ~ $ . L I ~ ~ c Q T Mod&. A ~ w l w h u ~ ~ SC~OO! Chiidr~ya I51

I see Dis~lay 7.13) giving almost exactly the san~c pvalum as the quwi-

:eneraliead linear models :pzlmIzatlon : ML

:euiancs - 1823.481292 rearsou - 2091.297W

-..eriance function: I(u) = u Link function : gtu) - ln(u)

No. of obs = 146 Residual df - 143 S c a l e paraaster - i (I /df) D w i m c e - 12.75162 ( l /df) Pearson - 34.62446

[Poissonl [leg]

AIC - 18.56136 B1C - 1110.826

class .I637288 0766163 2.14 0.033 .Ot35655 ,813892 e t h n i c 1 -.547743B .i5B538L - 3 . 5 0.001 -8584725 - . 23 iO l l l -cwr 3.168776 3065466 10.34 0.000 2.567856 3.789587

Display 7.13

-.. .:xclikiood solution:

g h &ys clans ethnic, f a m i l y (poisson) liw(log) scale~x2)

-re Display 77.44). We r m dm, IW hootstrapping via the boots t rap - :?fix t.o obtain alternative robust, strandad errors. Sinm hootstrapping - \olvm random sampling, wc firsl set Ihc seed of thc p u d n random - rmlm pncmaior usilrig the se t seed co~nrriarld so that ivc call run thc - q u c ~ ~ c c of cornroar~ds again iri the future find obteio the sarrlc rcsults.

set seed 12345678

:r the bootstrap prefix, the statistics are specified for which stan- ;-.rd Prrorx are ~eq~iirecl, I l~re -b[classl rind -b [ e th i c ] , followed by : .-omma and any bootstrap options. licrc reps (5001 to usc 500 rcpli- -rcs. Finally thc estitnatioll mrmnnnci is specified oStcr a colon.

bootstrap -b[class] - b [ e t h i c ] , reps(500) : /// gPm days c lass e thn ic , f m i l y (poiason) link(log1

-+ Dibplay 7.15). The htmditrd errors crmpiare quite well with thme :-:!lg the vce (robust) option or 118ing the qu=i-likelihood approach.

nc coi~ld also modcl ovcrclispcrsion by assuming n mndom e$ects = dcl where each child has an unobserved! random proncncss to bc :--mi, rro~n school. This proncncw (cdlccl J~rnil tp in a morlicnl context) : .3~1pl i~s the rate pred~cted by the comrlats1s so t h t some children

Page 158: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Germrdleed linsar w d e l s Optimization : HL

Deviance - 1823.481292 Pear son - 2091.29704

Log likelihood 1-1205.979185

NO. of O ~ Z = 146 Residual df = 143 Scale p a r a t e r = 1 l l /d f l Derjance = 12.75162 ( I l d f ) Pearson = 14.62-6

p o i a s d [Log1 AIC - 16.66136 BTC - 1110.826

class .I851288 .076W% 2.15 0.031 ,0147622 .3326954 ethnxc -.6477436 .1599813 -3.42 0.001 -.%I262 -.2342252 -toas 3.168n8 ,3170159 10.00 0.000 2 547437 3.780116

(Stamdud error. scaled u s i ~ g square root of k s r m X2-baaed disperalorr)

Displav 7.14

Bootstrap replications (500)

+ 1 + 2 1 3 1 4 + 5 $O .................................................. .............................................. I W .................................................. 150 ............................................. 200 .................................................. 250 .............................................. 300 .................................................. 350 .................................................. 400 ............................................ 460

500

Bootstrap results m b e r of obs = la6 Repliwtionr w SO0

comand: glm dayays clam e t h n i c , famlly(piason> l iekIlog) _ha-1: -bicla-ss~ -bs-2 : -b[~lhni~l

Obaervd Bootstrap I f m a l - k e d Coei. Std. B-r. a P>lel [9Sd Con2 . Intarvall

-bs-1 .183728B .0708052 2.51 O.M1 ,0249493 .M251@ -bs-2 -.5&77436 -1662lOB -3.75 0.ODO -. 83431 15 - ,26117-

Page 159: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

.:at* higher or luwer ratcs of ttkncc ~ O I U xhoof thm otlier cchildrcn xith the snrw rovariate. The obswved c.o~~nts nrc assumed to have H. ?oi%on distribulion conditional 011 thv rand0111 effwts. If Ihc frailties -.rr ~ssumcd to have a garrum distribution, then Lhc {mqinal) distri- 'sirti011 of the ro~ints has a negative bir~omial distrih~~tion. The ncgative -,ino~niel rr~odpl c ~ r t he f i t t ~ d using nbreg as follows:

nbreg days c l a s s ethnic

-re Displrty 7.16). Altcmatit~ly, the sarnr. model can be est,irnaBd Ang g l m d t . h family (nbinomial) .

Hmker of obs - 14% LR cbi2I2) - 15-77 Frob > chi2 = 0.0004 Pseudo R2 = 0.0141

daye 1 Cwf . St&. Err . a. P> la1 1951 C o d . ~nterval~

class ,1505185 .073a832 2.05 0.090 . OOBBB41 ,2941989 ethnic -.6114185 ,1578378 -3.43 0-001 -.85MY4B -.2320822

-cons 3.19392 .32176Bl 9.93 0.000 2.563266 9.824574

Ilnalpha -.1759664 ,1243878 -.Pi97619 -0678292

alpha .Ma6462 .I043173 ,6572032 1.070182

-~eiihood-ratio teat of alpha@: chihr;r[Ol) - 1309.47 Prabs=cbihrl - 0.000

.U1 foi~r mcthods of analyzing the daln l e d to the same cunclll- - t1u. The Poisson model is n special cnsc of the negnt.iw: birion~ial =,rdcl with m = 0. The likelihood ratio test, Tor cw is thewfore a tcst of -:.p ncgntive binomial agailld the Poissun distributiori. The very srnal! -- , alue ''against Poissor~~ indimtcs that t.herc is sigr~ificant, overdisper- - ~ u . (Yclte tllal, ay indici-tted by ithe ~xprcssiori chibarZ(01): the text .+ 5xwd on the rxrrect sarr~plir~g clistribution taking into account that - : 111111 Ilyp~tbesis is on the bor~~ idar~ of the pararnclcr space, are. c.g., i:i~ders and Boshr, 1999.)

7.4 Exercises

T.l . Effectiveness of slimming clinics

Page 160: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

1. Calmht,e the F-statistic and diffewnra in deviance for adding status to a mnodcl n l r ~ d y containing cond for the data in slim.dat.

2. Fit the rliodcl using status as the only explanatory vnriable, using rok1~5t standard errors. Bow do- this comptilre with a t-test with unequal varianrm:'

7.2 . Australian school children I 1. Carry out a significaricc t s t for the irltc~*m,tion between class

.and ethnic for the !data in quine . dta. 2. Excl~tdir~ thc potential outlier (rase T2) , fit the model with

cxplrtnatory variables ethnic and class. 3. Dichotomize dayti a k n t from school by cl~%uif?ling 14 clays or

more HS frcquent,ly a k n l . Analyze this new response sing the glm command with both logit and probit 1i1iks and the binomial family.

4. Repcel the anztlyses with the vceIrobust1 oplion, xnd com- pare thr robust standard error? with thc standtard errors ob- lnined uqirq booistrapping.

Scc also the Excrcism in Chaplcr 11.

7.3 Wave damage to cargo ships

McCullngh and Nelrier (1989) dwribc dataprwidcrl hy J. Grille?- and L. N. Heminpay rjf Lloyd3 Regh\ter of shipping concp,mhg thc damage uauscd by waves to the forward section ol ce.rtm cargo ships. The dala are in t b form of a table giving the t o t d number of d m r w incidents bj. Three Factors (I) ship Lype, (il year of cu~lstruclion and (3) period of operat.ivn. (For fur th~r discuwrior~ of Lhis kind of aggregat~d data, see the next chapter. Thc total numbcr of montlls in service for cach ship lype is a h giver^. The purpose of the arl,alysis is to investigate the risk oi &iunagc associated wi Ch the threc factom.

The nsizbles in the di t twt ships. dta are:

damage: total numbcr of r h x w incidant,~ type: ship t.ype (A, B, C, 11, or E)

w c o n s t ~ c t i o n : year of Construction (1960-64, 1965-60. 19T- 74, 1975-79)

w operation: period of operation (1060-74 or 1978-79) w months: agpe~ate m~rnber of months i11 scrvice

N o ~ c that type, construction, and operation arc string %a<-

Page 161: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

1. RTcCullagh mid Nelder (198.84) consider a log-linear Poisson ulcrdel with main effects for type. construction and service a~ ld with the logarit,hm of months as nn o k t (a. cuvariate with r~grmsion rmcfficient s P ~ to 1). Fit tkic model using the glm conimluid (sw qt ion of f se t 0) .

2. Repcat the nndysis above by relaxing the xsfurnption that cfA= I .

3. Obtain exponentixtod rcgession cr>eEcients and interpret them. 4. Derive scald Peamln residuals md dist:uss if there rare any

potentid outliers. 5. Consirlm including arl interaction bctwccn ship type ~ r l d year

of wnstruct~on. Note that ran F-test d~ould be used demo~l- stlhatcd for a Iinemh mod~ l in Scction 7.3.2.

T.4 Clotting times of blood

Hcw we mtwidcr data originally plhlisltcd by Hum rt at. (1945) and provitl~d by h.IcCull& 3 r d 9cIdt.r (1989). Normal plmnxa wm diluted to nine di ffwcnt cnnccutratiu~~s with prothrombiri- fi.ce p l m a and clotting was induced with two different lots of

The variables in clotting-dta are:

w l o t : lot. number (1 or 2) r ccnc: eoncentratiun of prr>throrr,bin-free plasma (in percent) n time: clr>ttittil~g time

1. Rlbwing McCullagh and NeIder, use a Ir>g transfonmtion of conc a d specify a gamma ditriI,ut,ion and a reciprocal link function 1 1 ~ ~ ~ = PI* (use the link(reciproca1) uption). Fit, the following scqucnce of rnodels fur the dultine; tirnw ( 1) wlthout eovaiates, (2) wit;h a mi11 effect of lug corzmntrrttion, (3) with l m n effects of log mnrentrittion and lot and (4) with mam effects of log conwr~tr~tion and lot md their il~tera:tion. Use F-~PYLS as shown in Scction 7.3.2 to deride w b i h is the bmtrfitting model.

2. Ca1culate predict4 nlcan rmtion timw m d plot them versus conccntratiort, using different line st,ylcs ffo tthc two lots. Also shtrw the ol-wcn.ed d o t t i ~ q tirnes m points on the snmc graph.

Page 162: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 8 - -- - - -

Summary Measure Analysis of Longitudinal Data: Treatment of Post-Natal Depression

8.1 Description of data The datasct to be anal>-~cd iu this chaptm originnl~s from a rlinicd .ria1 of thc use nf t ~ i r n g c i ~ patr l~es in thc rreetmunt uf postnatal de- ~ e s i o n : h ~ l l details are giw11 in Grecoir~ el 01. (1996). 111 t,otal, (il ;r-ornpri with major depression. rr-hich bcgau within 3 month3 of child- i i r th arid persisted for up to 18 ~ntmths posrnalally. were ~llorated -~~ idomly tn thc xctirv treatment or x plavebo (a thmq pntth); 34 - 1 l r ~ i v ~ 1 thc fol.me a d the rcmmning 27 received thc latter. The xorricn nrre assessed pretwatfnent and 1nonthl.v for six rnor~llis -~ealmerlt nn the Edinbrlrgh postnatal depression scab (EPDS), higher .;slues or which i~idicatc increasingly sm ere daprcssio~~. Tllp dam are -!lox\ 11 ul Tablc 8.1. a I~~IUP 13f -9 ill this tabk nldicates that the olser- -.ation is missing. The non-integer depression =ores result kom ~russing iiucstionnairc ittrna (in this case the avprag of all available itc~ns was ~lrulriplicd b3 the total ~~rmbcr ol item<). The variables arc

w group: treatment group (l=cstrugcn pntrh, O=plarebo pxtc.11)

pre: pretreatment or baacline EPDS depression score

Page 163: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

depl to dep6: EPDS depre~4o1k =ore For visits 1 to 8 The main q~mt.ion of interest hcrc is ~vhetlwr the estrogen patch is effective at reducing post-natal deprmsion compared with the placcbo.

Table 8.1 Data in depress. dat m l r j group pre depl dep2 r l r1>1 r lqd dep5 dep6

I 0 18 17 18 1'1 17 18 15 2 0 27 3t2 '13 1 R 1 7 12 10

Page 164: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Data in depress.dat (continued) 51 I 25 15 24 18 5 . 1 I:< 12.32 72 1 1R 17 6 '1 2 t1 1 -53 1 2fi 1 18 10 13 12 10 5.1 1 2W 27 I.? 9 8 4 > 55 t I7 20 1U 11.89 8-49 7.02 8.m 56 1 22 12 -9 -9 -1 -0 -4 J i 1 22 15.:IR 2 4 li 3 8 58 1 23 1 1 9 10 S 7 4 59 1 1 7 15 -g -9 -g -9 -9

6U 1 22 7 12 15 - 9 -!I -9 ( r l 1 26 u -9 -9 -n -!1 -9

8.2 The analysis of longitudind data The data in Table 8.1 consist of repeated obscrvati(n~s rlvcr time 011

csrh or t.hc 61 patients; buch data are wnrrzjly rcfwrcd to aq loageludi- rrai datn, panel data or repeated measurmcntu, and ~5 cross-secttonal t rmc-sdes iri Stattla. There is a, large body or mcthods thah can bbc ~ I s H ~ to unalyze longitudinal drtts, ranging frum thc simple to thc corn- pbx. Snmc ~~se lu l rcfprencm are Dig& et d. (2002), Ewxitt (19951, snd Hand and Crowrlcr (1996), 111 this chapter we conwntratc on the iollu~ ing approaches:

Graphical dispkap

B S u m q ~ n e a t ~ r ~ : or response feature analysis In the r~cxt two chapters, morc formal modcling td~n iques will be ~pplicd to the data.

8.3 Analysis using Stata .issuming the dnta are in a11 ASCII: file: depress.dat, as listed in Table 8.1: th1c.y may be rrad into Stata for ~lialysis ~ i n g the folltlwing i~~st.ructions:

inf i le subj group pre depl dep2 dep3 dep4 dep5 dep6 /// using depress.dat, clear

mvdecode -all, mv(-9)

The wcond or t , h w i~lstructiona converts values of -9 in the data to ~niwing d u e s .

It is nscful t.o hegin cxmninntion or these data using the summarize c o i n ~ ~ ~ a ~ i d Go calnilate memtra, variruiccs, etc.: within each of t l ~ two trmtment g o u p :

summarize prs-dep6 if group-0

Page 165: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

160 A Hnndhook of Sfnk~t . i r~al AnuIpw Umny Slala . -

(we Display 8.1).

Variable Hean S t d . Dsv.

dep2 16.53818 6.124177 dep3 14.12882 e.974648 4.19 22 d e F 12.27471 6 848791 23

dep6 17 11.40294 4.438702 3.03 18 depb 17 10.89588 4 68157 3 45 20

Display 8.1

summarize pre-dep6 if g o u p = = l

(see Display 8.2). Thc~u: iu n generrtl decline in the deprcwion mre

varlabie Mean Std. Dev.

11.73677 6.575079 29 9.13U38 5.475564 28 8.827857 4.666663 0 22

28 7.309286 5.740988 o 24 28 6.590714 4.730158 1 23

over time in both groups, wit11 the wlu~ts in the active Lrcatment p ~ u p nppearliq to hr r:orisktw~tly lowr.

8.3.2 Graphical displays

A iiwfill preliminary stcp in the anralys~s uf lor~ittidinal dnt,a ir to amp11 the ohwmtions i11 some way. The aim is to higldighl t.wv par- ticuhr aspects of the data, naniely. how thcy evolve over ti~rle and how the ~r~easurcmcnts made at difkrerit tirrtes mc rclalrrl. A number of flap1.1ical displays can he 11sed. iricluding:

separate plot3 or tach sul>jjart+s rwporLsw against time, differen-

Page 166: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

tiatin:: ia mine way b e t ~ ~ subjccts in diffcmnt groups

8 boxplot3 of thc obwrvntiom at cach time poirlt by trcatmcnt B O U P

~6 plot of tlleans R I I ~ st,:mdlrd mrors by treaentrne~lt group for c e r y time point

a smkterplot matrix of the repeated meosusme11ts To hcgin, plot the r e q d r d scatterplot matrix, identifying Redmeut

S ~ . O L I ~ S with the labels 11 and I, (wing

graph matrix pre-dep6. mlabel(group) rnsymbol(uone) /// mlabposition(0)

The ues~llting plot is shmn in Fiurr 8.1. The rnnst obvious feature of rhiu dixgrm is the inrxeasingly strong relatinnship hetween thc mea- surcmentn of clepresion as the time in tend hetween them docrrm. This has irnportar~t, impli~at~ions for {,he niodek xpp'npriate for lorjgi- indind data, .w wc will RW in C!m~pt~.r 10.

Figme 8.1: Scatter-plot rnatdx for depression scorw at six visihs.

To nbldn the other graph ment,iond abwc, the d ~ t a s e t needs to be rwtructurd from it,s prescnt wid^ fom~ (onc column for PA& viSit)

Page 167: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

to the long form (one row for ~xch visit) mjng the reshape command. Befrxc n~nning reshape, we will preqerw thc daln using the preserve comma~id so that lthey car1 later be rcstorcd using restore:

preserve reshape long dep, i(subj1 j ( v i s i t ) list in 1/13, clean

Thc first 13 ohsew~tions of t,he data iir long forrn are shown in Di play 8.3.

Display 8.3

To inspcct thc patterns of miwing values in this dmt , we first delete ohwmtions urherc dep is m i d u g and thcn usc 6hc xtdes cornmatid:

drop i f dep==. xtdes, i(aubj1 t(visit)

giving the output shown ill Display 8.4. We sce that 45 subjec+s have complete data, 8 subjects dropped old a f t ~ r blw timt visit, 7 aftm the smond, and 1 aftcr thc third. Thih kind of misaingnws patkrn is called ' mnonotonic" bwauw! peoplc nevcr return once tlwv have missed a visit.

'CVe will now plol the ,sllb,jmts' individual response profilcs cmr the visits separately for ~ltch group using the by() optjon. To obtain the corrmt group Iabds with thc by 0 option we rriufit first: label the d u ~ of group:

label define treat O "Placebo" 1 "Estrogen" label values group treat

In etwh graph we waut ti, connect Lire poirits tlelongillg l o a given suhject. but avoid conncctir~g point6: of different subj~L%. A simple way of achieving this is to llsc thc connect (ascending) option. Before

Page 168: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

- - - -- - - -

xtdes, i (sub]) t(via1tl subj: 1, 2 , .... 61 n L 61

v ~ s i t : 1, 2 , ..., 5 T - 6 Delte(vlsit) - i ; (6-l)cl 6 ( s ~ b j r v i s i t uniquely identllies each observation)

2 I a t r i b u t ~ m ai T - i : mln 5% 25% SOX 75% 96% max 1 1 9 6 6 6 6

Frsa. Percent tm I Pattern

plotting, ~ h s datArl ~icnd to he mrtcd hy the ~ a ~ p i l ~ g varixthc and lw thc 1. vurinhle (hex visit):

aort group subj visit twoway connected dep viait, connect (ascending) by(group1 ///

ytitle (Depression) xlabel(l/B)

'l'llc connect (ascending) option connects point,s only so long v i a i t is d ~ e n d i n g . For the first, su11j~r.t (subj=l) (his ia tr~tc; bul, for the swond subject,, visit bcgiilr; at I again, so blrc last pr~irrl. for firth- ject one is rtoI chunnectcd with thc fils1 pdnt for subject t,m. 'l'lic rrrrininiill: points for thi5 s~rbjec:t ore, howvm, con~cct~txl nnd so url. The xlabel0 optmion tws l ~ e d herc to make the nxib rmgc s t a t at 1 instcnd of 0. Tlie dingam is shown i11 Figurr 8.2. (Sorrjc point3 fur diffcl.wt a11h.icctr: arr conmctcd at visit oric; this happaris whcn sucr&iw! suhjecls hiivc rnis~ixig data, Inr all ~ u l s q u c n t visits so tl~nt, v i ~ i t TIOW not ~CCTBRSF! when subj ir~crewm.) The individi~d plots re- Ilrc:t l h ~ gcrierul r ld in r? in the tlepressio~~ srorcs over tirrlc indicatccl by t h ~ : nlcans oht-aincd wine: ll~e summarize com~naud; thcrc is, I~owcver. runsidernble variability. T l ~ c pl~cnornc~~rai of "tracking' is appwcnf; alicrsbrr ~ o m c individuals ltav~ mn&tcnt,!y highcr vnlr~ee than othcr . - ir~d~vldunis, Icadirig to within-subjmt correlntions. Notice thnl. wmr: profiles are not complete bmusc nf rnissirrg viill~es.

To crhtain the hmplottr of t l ~ c deprcsqiu~~ s c u r ~ at each visit for cwh trcatrr~cnt group. the following instructio~~ can be ilscrl:

graph box dep, overlvisit~ over(group, /// relabel (i "Placebo group" 2 "Estrogen group") )

Page 169: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

F3 5 I g 2

9

0

visit Qraphsbygou,

Figure 8.2: Individ~~al wsprmse profiles l,y treitmmt ~ T O I I ~ .

Hcre L11e aver() optlorls spcdfy two groapir~g vnriabk, v i s l t and group. to plot the cl~strib~~tiouti by visit within groups. The relabel 0 optloll i s used to dcfinc labels for the group. IIcre "1" rcfm to the first level of group (0 in this eusr) and ':2" to thc second. 'fhc resulting graph is shown i r i Fimt: 8.3. Again, thc general dedine in dcprwion worm in both treatmcnt, grolips can be srm and, in the ac,tiue trcat- ment group. thmc is some evideiluc of outlicr~ which may necd to be m'1miriM1. (Finire 8.2 siiows lhat four of the out,hrs are due to one subject whme rcsponse profilc lies above thc others.)

11 plot of the mean profib of each treatment group. which includff infora~stion about the standard v.rrors of PA& mncnrl. can be obta in4 us in^ t.hc collapse iustruclion thtbt prodliccs a datnsct consisting of selected summary statistics. Hcrs, we nccd the mean r1eprcsuion score on cad! visit for car11 group, thc curlaapondi~lg standard ricviations, and a count of the nunlbcr of observations on wllirh theac two statistics are bwcrl.

collapse (mean) dep (sd) sddep=dep (count) n=dep. /// by(visit goup)

list in 1/10, clean

(see Display 8.5). The mean vnllle is now stored in dep: but s i n e r n m than orle summary statistic for t,llc tlepression scores w r c required, the

Page 170: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

2 3 4 5

Placebo gmup 2 3 4 5

Estrogen gmup

Figure 8.8: Bwplots for six visits hy t.~~atment w u p .

v i s i t poup i I Platebo 2 . I Estrogen 3. 2 Placebo 4. 2 Eatragen 5 3 Placebo 6 . 3 Estropm 7 . 4 Placebo 8 4 Estrogea 9. 8 Placebo 10 5 Esrrogen

Display 8.5

Page 171: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

remaining statistics wcre giver~ new ruuncs in the collapse instrudion. The required meark and stxr~dard error plots car1 now be prodiiced

as Follons:

generate high = dep + 2*sddep/sqrt(n) generate low = dep - 2*sddep/sqrt<n) tvoway (rmea low high visit, bfcolor(gsl2) sort) ///

Iconnected dep 'lint, mc01or~blac.k) / / / clcolor(black) aort), by(group) /// legend(order(1 "95% C1" 2 "mean depression'?)

IIrxe twoway rarea producm x shaded itreta bctmth the lines low wr- sus visit and high versus v i s i t , thc 95% confidence limit3 far the rrlenn. It is importarit that the line for thc. mmrI is plotted after the ~ h d e r l axarea becallst! it wwlkd otherwise be hidden underneath it. The sort option is wed both ~n thc rarsa and connected plots to ensure that, the areas x ~ l d Iinw m drrttvn for v i s i t in asoending order. The resulting diagram is sllown in P i u r e 8.4.

FA4 - -- p-.-7==7- Y.7: =-zr2"-m-7---.e-?

F3 -

'" -

P -

w .

a z 4 e o visa

1 95% CI -+ maan depression ( Gl;lphsbygmup

Figure 8.4: bled1 and stmclard exror p1nt.s; thc shaded areas reprcsm- f 2 standard errors.

Page 172: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

-- Summary Mwbvrre ArualysbP u j Lonyiludtsd DuLa a 187 ------ --- Table 8.2 Response features suggested in Matthews et d. (1990)

'I'ypc of Prnperty w Bc bu~nparrd dnt.a bptmeen goup? Si~ronlary meaaum PcaM overnll vdu? of rmpor~sc II>CI~~> or arm under clirvc

Penktwl ml.lue of most extreme response marimurn (minimum)

Pc;~ked delay in rmponst: tirnc t o mrurjrnum or mir~imurtr

Crowth ra1.c of change of wsponw linear regwinn cwfficisnt

Gn~wth h a t tsvcl of rmpunac firm1 d u e or (rclativc) diiicmncc hetween fint and last

(:rowtl~ clelav in r c w m time to r m b H particular valur

.\ rclativcly straightforward appronch to t,hc w&vsis of longitudinil data. is t.hnt inrwlvir~ the me oE s?mmaPy mramRs, sc>met,irnes known aa mvponw jeatlsre andytw. Thc rwyonse of each subject are w d to mnstroct a single nunrber that characterizes sume ~ l e v m t aspect of rhe aul~jcct's rcsl>unse profile. (In some sittmtinris more than a singlc inrL>mmy mellsrim may be reqliicd.) The winmary rr~aavure nceds to he clirwcn bcfore t h ~ analysis of the data. The most cou~rnu~~lp ~d measure is thc irican of the rwpomes 0w.r time becnum many invm tigatiom. e.g., c,li~~ical trials, are most conccrmd with differences in U Y P T ~ ~ Iwels rather than more suhtlc effects. Other possible summary nlcauurm ruc lis?d in Matthews et d. (1990) and are shown herc in Table 8.2.

Having idmtiked a suitable slimmary metruure, Ihe ana1,ysis of the slxra generally ina~dvm the application of a simple univari~tc t.mt ( i~su- dlly R t-test or its nonpwfimetric equivalent) for group diffcremcc? on ;Iic hiillgle rncasure now available for aactl subject, For the wtrtruge:~ patch trinl data, the T I I C ~ over time scem an obvious summary mca- -uw. The srlean of all non-missing vdlues is obtained ( ~ S L ~ Z restoring rl~r dala) using

restore egen avg = romean(dep1 dep2 dep3 dep4 dep5 dep6)

The differences hetween these rncans may be tested (~4% a t-tmt as- -u~ning ~qual m ~ i m ~ c m in the populat.ions:

ttest avg, by(group1

I see Tiisplay 8.6). The as~uu~pt ion of qrml mrirs~lca can be rchxed using the unequal option:

Page 173: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

-

dl?? 4.20399 1.294842 3 0 1 8 794964

d i f f = mean(O) - meart(i) t = 3.2487 Ao: d i f f - 0 dagrees of frasdom - 59

Ha: d l f f c 0 Ha. di i f '. O Ha: I f f > 0 Pr(T < t) - 0.9990 W I T t I Itl) 0.0019 Pr(T > t) - 0.0010

Tuo-ample t test with aqual variences

ttest avg, by(group1 unequal

(see Display 8.7). 111 each (:we L11e concInslon i?. that the menn depres-

Group

0 1

Tro-sample t test with unequal variances

Oba Mean Std. Err. Std. Dav. (95% C o d . I n t ~ u a l l

Z? 14.75605 .87B2852 4.663704 12.96071 18.56139 M 10.55206 ,9187872 5.357404 8.882772 12.42135

diPf - msran(0) - m e d i ) t - 3.3075 AD: d i f t - 0 Sattarthuaxte'a degrees of freedom = 68.6777

Ma: d i f i < 0 HI: d i f i !- 0 Ha: diff z 0 PrIT C t ) - 0.9992 PrtlTI > I t l ) - 0.0018 Pr(T > t) P.00Cd

Display 8.7

sion score is s~bstmtially lower in t l ~ c cstrogen goup than thc placeh group Thr diHcrence in Irican depression scww is cst;im~t,ed as 4 2 wit11 a 95% rnr~fidcnw intervnl (~ssrlmiflg cqurt! vrarianms) from 1 .G ro 6.8.

We mkht also be interested in the rote of change (here dccline of tho rasporw. An appropriate surnIuary rneAsnrc is the regressior.

Page 174: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Summary .UYWTC A r d l g i s of Lon.qitadi~1~1 Data W 169 -------

ro~ficicrlt of deprwsion on visit. This can hc obtained ns a weighted -111n of the dpprcsion scores at the six visits. However: for snbjccts n-ho dropped out, the lcwt squares mtimat.tor will not bc the mmc ttr

for subjects with mmplete data. It, i s therefore cw>nsidcrahIy easier to ask Statate to estimate a linear regrcssiun model for cmh subiert using :he atatsby prefix. Firbt we must iwhape the data to long form 2~forc.

reshape long dep, f(subj) j (v ia i t1

Soiv u.e can. IW. the xtatsby c.ominmd to replacc the n~rrcnt dxtaxt q a datuuet of summary statisljcs for m h subject:

statsby slwpe=-bCvi6itl inter=-b[-cons1 df=e(df-r) , / / / bycgroup subjl clear: regress dep visit

list in 1/10

Tht: s~contI line specifies that dep should be regr~sscd on visit fur each ,~rliqi~e cor~il)inalior~ of group and subject. Tl~c reasr)n for specifying group here is so that, this variable appears in the summa7 mczure ,lalaset. The firbt line specifies which rwults from thc regrwqio11 oclnl- 111and should bc stored urrder which variable name. Hem slope will I ontain f ~ c regression rnrficient of visit, mtsr the constn~~t, and df tl~e ~widual degrees of fre~lom. The first ten observations of the .lciv d u t w ~ ~hoiw in Display 8.8. TO cornyxe thc IIEMI slope^ fnr

Display 8.8

1. 2 . 3 . 4 . 5 .

6. 7. B 9. 10.

;iil~jec~s who hnd at least 1 rmidual r l ~ g r c e of frccdom, we again usc ihc ttest comrnand

group ~ u b j slopa inter di

0 1 -.571(1281 18 4 0 2 -3.267143 29.06667 4 0 3 -3 a0 0 0 4 -1.342817 10.W667 4 0 5 -1.542857 12.73333 4

0 B -2.426 18.39267 4 0 7 -1.342-7 14.2 4 0 8 1 25 0 0 9 -3.687428 30.06287 4 Q 10 1.571429 8 4

t t e s t slope if df>O, by(group)

Page 175: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

170 W A Hmdbaoh of StaMcnL Analpses Using Sbtn - - --

giving the output s h o w ~ ~ in Divlay 8.9 'I'here is no cvidcnoc Eor a

T u o - s q l e t test with aqual variances

C r o w I O h Hean Std. Err. S t d . fleu. 1957, Coa i . interval1

difference irk the mean rate of declinc between the two groups. The w~m~nnry mrmurc npproacll to longitudinal data h a rnimber

of advantages:

d ~ f f

w Appropriak choice of surnmnry memure emure that the anal- ysis is foct~sed on rclctanl and interpretable aspects of the data.

-- - . I o ~ D s ~ ~ , 4 2 3 ~ 7 8 -.96os7ir .748451S

w Thc method is emy to explain and intuitive, and

df f f - mean(0) - mesn(1) t - -0.2601 Ho: diff - 0 dsvnan of freudem - 44

Ha: diff < 0 Ha: d l f f I - a Ha: diff > 0 Pr(f < t) = 0.4018 R ( I T I > I t l j - O.80X Pr(T S tE - 0.5982

W To some extent. missing and irregularly -spaced 01)wrvationx can he accommodated.

However; thc rncthod is surnewl~at ad hoc, part,ici~larly in its trmtrnent of missing data. For instance, il the sunmimy measure is a mean, but there is &IIR~$ a declii~c ill thc rcsponse over ti~ne, the11 the rrlean of all mailable data will ovcrcstilnnte the mean for thrhrt: whu dropped out carly (a better surnIrury measure in this CAFE in the intercept from a lincnr regression rr~odel). Fiirth~rmoi-e, response feature analysis trcats ail, summaries as equally precise even if some are based on fewer obser- vations due lo missing dntn. In the next two chapters we will therefor discilss more formal appmnches to lorig~tudinitl data. ritndr)m &ect,c modeling, and gcncsnlixed estimating eqliatinns.

8.4 Exercises

8.1 r TPeatment of post-natal depression

Page 176: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

1. Produce boxplots corresponding to those shtlrv.u in Figure 8.3 wing the data in wide form.

2. Compare the rrsidts of thc I-tcsts given in the text with thr corresponding t - fmtq cdculatcd only for those s~?jccts having observations un all six post-rmdomizativn visits.

3. Repeat the nl~rrirrulry mcnsures analysis demibcd in the text using the rnaxirnilm o w tirnc instead of t,he mean (soc help egen) .

4. Tes t for diffem~ces in thc mean over time controlling for the htwline rnens~~rpmnent using a. a change xore d d n d as bhe diffwear~ between the mnem

over time and the bboscline rneasrlrment, mcl 6. xnzlysis of c:owiarlce o l the mean owz time using the

baseline mncm~~rtrnent m a covariatc. See dso Exercises in Chapter 9.

5.2 Wage increases

1. For thc data described in Exercise 1.2, prntlnce bwrplols for t.he log hourly wage aver time by ethnic/racicirtl grmp.

2. Plot the mean Iog w~gc over t,ime by ethnic gmup, a l ~ m i r q thc 95% confid~nra 1 ) d s ns in Figure 8.4.

3. Comparc the mean log wages bctwccn the t-hree p u p s using rn~~lliplc regression with dummy vsriablps.

4. Repeat the analysis ubovc but t.hii t<ime controlling for educ. the numher of years of schooling.

5. Intwprct the finclings.

See also Exercisc 9.4.

8.3 Jaw growth

Tn t,his jaw growth daGaxct fmm Pothuff and Roy (1964), e l m n boys m d sixteen girls hml the distance betumn the r a t e r of thc pituitary gland m d tla ptcryomaxillary fissure rcoorded at qs 8, 10, 12, and 14.

The variables irl Ihc dataset growth.dta arc:

w idnr: subjcct identifier measure: distance betrwe~~ pituitnry and maxillary fissu1.e in millimeters

w age: age in years w sex wx (E=hog;s, 2=girls)

1. Plot the ohsewed grmvth trajectories: i .e. , plot measure against

Page 177: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

age. connecting mccesxivc ob~rvations ou the same su h ject using the connect (ascending) option. Uw t h ~ by 0 opbinrl to obtain wparatp graphs by sex.

2. t'se thc statxby prefLv to obtain estimated intcrrepts Rnd slopes for ad^ snhject and compare t l ~ c zneem of Lhsx sum- mtwy measures b~twccn bovs md girls nsing indcprr~cla~t snmplcs t-tests. To n i a k the iiitercepts rrmaningful. subt,ract K ftwm age be for^ runni~ig t h e statsby prefix rrxnrnrtr~d,

8.4 Treatment of Alzheimer's

Tllc data used l m ~ arise horn an ir~t~stigution of t l ~ c use of 1 ~ 5 t h in, a prpc:ursor of c!~(di~ic. i11 t h ~ trcnta~cnt of Alzheimex's tliscwr. 'I'caditinrtally it Ilas bccn assumed that this coritlilion i t i ~ ~ l v m ail inevitable ant1 prn)grcssive d~twiarafion in all us- pecls of ir~trllccl, wlf-care. nnd pcrsona1it.y. R~t:er)l work wg- *%Is that tlir ciise~~s~ inlvlves par liolr$icaI cl~rlngca in the cell- tin) tfdincrgic systcln, vhirh i t miglit lw possible to rc~iietb Itng-tprm clirtary rxtrichnicl~t with Ircithirl. In particular, klie trratmrnt might slon- down or mvti halt the nleinnry i~npJrmeiit usudlv associat~d with ~hc! conditioil. Peti~rlts wffcving from hlalleirrlrr's dis~nw were randomly ;illocattxl 10 rcc~ive cirhcr l e chit111 or pltlccha lor a six-rnur~lh pr,riod. A c o ~ ~ i t i v c Lest score giving fhr nnm1)cr of w o r k rcrallcd frurn a previously sLuditd list WCIS rrcorcl~l at the start. at oric mollth, at t~ru ~nonths, at four. rrlonths and at six months. (Thc data are giver1 in Evcritt arid I'irklm. 2(Xl4.)

TLic ~ ~ r i n l ~ l c s i r ~ alzheimer.dta om:

group: trcata~ortt grol~p (1 =phmho. 2=lacilllin) v l to v5: nurnhrr of words rccalled 3 the start, w ~ d each su~bscqne~rt month

I. In thcse data t l c rliuiciwls \\me sgw:ificnlly int,wcstad in the rnaximrmi va111e or the response wrinhle over ~ I I P five ineasuw malt occasions. Gellcrate a rarinhle ctlu;il t o thc miuti~num rr~cnslmmmt For each permn.

2. 11-e NisIi to coinpwc the dist.ri1)u tior of t l ~ c maximum num her of nurds recalled amss the live visits bctweer~ t,rezlm~n: grnnps. Do ill? ass1111lptin~~ of an ilillepcnde~it wnples 1' te appear to bc satisfied'!

3. US? a11 appropritttc test far comparing thc treatmeld group-

Page 178: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 9

Random Effects Models: Thought Disorder and Schizophrenia

Description of data rl l jw clmptvr \\$c will ai~nlpzt? (la10 fror~~ thr M~riras 1,oilgi~tidind

- i>izoplu*cnix Slucly ilk which pat.ic:uls were Bllowv(?rl up niontlhv aft.er - 41. fi rsl l~ospitali.~nLim for scliizopllrcnia. Tl~c slrrdy is clwcril)cd in

r a i l in ' n ~ x r n ct a!. (100-1). I-lcrc wwt: 11s~ n sulwet, uul' lht! data nnxlyred Digglc ct (t i . (2002). nnnrttly c1al.a on tVlloughi. diwrrlcr (1: pr.mt?rlt, >~l)srtlt.) at 0, 2, R, 8, ant1 10 1irunl.h~ tiftc~, Imspita1ixnl:inr~ 081 rvomei\

.I!. The tl~ought disordnr rsupolj.ws arc ~ i v c n as YO lo y10 i11 Tahlc. 0. I :Ivr.r it ',.': ii~rIi(:at,cn a l~lissir~g vnlr~c. Tl~c vartri~blr early is a durn~ny ,~~:II>IP for cady o11set or rlisewc ( 1 : agc-oFonsvt 1ms thar~ 20 years: 0: I+-of-ot~st?t 20 ycn1.s o r nl)ov~). A11 importtl~rl ciiwst-ioll here is wlrsthe~' .P i:oursr, of i l l r ~ c ~ differs b o t ~ ~ ~ e ~ ~ pntiants with early n~ld late nnsat. . .-

. I F ~~-.ill nlso reaiinlyxc? t l~c post-nt~i.nl depression data dcscril)cd in t,he - :-e\.io!is c:I~apt.er.

9.2 Random effects models Ilc d h t , ~ liilccl in TClblc 9.1 ~wi~sist uf r c p r ~ t ~ d obsrrvations on tfrc

- -.~uc wbjw~ Inken ovcv time u~icl arr il further example of a sot of lon- trtdislal daln. lSul.ing the last clecarlm, slat,isticians have consider.mbly

173

Page 179: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Table 9.1 Data in mdras.dta

id early jI) y2 yb y8 y10 1 U I 1 0 0 0 6 1 O U O O U 1 0 0 1 1 0 0 0 1 3 0 U O U O O 1 4 0 1 1 3 1 1 5 0 1 1 0 0 0 I 6 0 1 1 0 0 2 2 0 1 1 n o o 2 3 0 0 1 0 0 25 1 0 0 0 0 0 2 7 1 1 1 1 0 1 2 8 0 0 0 . 3 1 1 1 1 0 ; ; 3 4 0 1 1 0 0 ( I 3 8 1 I 1 0 0 0 4 3 0 1 1 1 o o 4 4 U 0 1 O 0 0 4 5 u 1 1 0 1 u 4 6 0 1 1 1 O U 411 0 0 0 0 0 0 5 0 0 0 0 1 1 1 51 1 U 1 0 0 5 2 ; 2 1 1 1 0 0 $ 3 0 1 0 U O O 5 6 1 1 0 0 0 0 5 7 0 O I J O O O $ 9 0 1 1 0 0 0 6 1 0 a n o o o 6 2 1 1 1 I I O 0 0 5 1 n o o o o 6 3 0 0 U O O O 6 7 0 n ~ o u o 6 8 0 1 1 1 1 1 7 1 0 L l I O O n o 1 o o u o 75 1 1 0 0 . . 7fi U 0 1 . . 77 1 O D O O O 7 9 U 1 . . N O I l O O O 8 5 1 1 1 1 0 0 8 8 O U 1 . . . 87 0 1 0 0 11 % I 0 1 1 0 0 0

Page 180: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Random EflilcLs Mod&: Thought Di$rrrder and Qhiaophmdrs W 175 -- -

::irhed Ihc methodology available for the ar~dysis uf such data (w I.ndsey, 1099, or Diggle ed aE. 2002) md many nf t h w develupmcnta r e irrlplsln~nted in Statma.

L sngitudinal data require sper:id methods of a n a l p b bccaww tl~c r e -- 4)nseh nt different timc points on the samc individu~d may not be - ~Iependrnt even aaftr conditioning on the covvariates. For a li~iear m- .-+on model this mcails tha t Lhe r ~ i d u a l s for the same individual I --. correbted. We ran ~nodd these rcsidual correlations by parbit.ionirg -;a total resirlud for silllbject a at tirric point J into a sut~jcct-spsp~cific - : - F ~ O ~ T L inlempf or permanent minponmt ~ b , which is co~~starlt over - :it= plus a rcsid~lal 6, wl~ich varies randomlv over time. The resulting -:ldom ar~lewept mvdel can be n~ i t t cn as

- . ?P r d o m intercept and residual are each murried to be indcpen- .-?rly ~lormallj~ distributed with zero rneaIls and cunvtnnt vttritances ra --.d cZ, respectively. Fnrthermore, these random terms are murned to -- indcpenderit olcnch other and the covrariak x,. (Tt d-~should be noted -:.'r rr~o~nent-hwd npproxhes do not require nrxmality assumptions, -- . p.g., l4JmJdridge (20021, Section 10.4.)

T h p random intrsmpt model implic3 that the total residual mrianoe

J :r to this decomposititm of khe trltd rcsidual variance into a between- - .?jwt compone~lt ? and & within-subject cornpo~lent rr2 , the model is - ::ldimm referred. lo as a vumaaacr: components model. The covariance 1 ---nrrnthe~trrlr&d~1alrat~ytmtimepoin~~radjonUu~~ir

1 --.')]cct $ is

'I :r that thcsc comrianccs are irlduced I,? the sllared rnndoln i~iCcr- -31: for trubjccts with u, > 0. the tola1 residuals will tend to be larger - - h n t,he rncnn (0) mird for 6ul)jects with .u, < 0 they wit! tend 15 he -:.allcr t,hxri the mean.

It lollorus horn the two rclztions above t h t the rmidud corrphtiom

Page 181: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

mc given by

Thb znhrlasu co~wlnt8on car! be interprctd as the proportion of the trrtal rcsidud wriancc {dcnomiuator) that is duc to rmidual varinhilily between snbjetla (numerator).

The ranclom illt,erceep can hc intcrprckd s the cnrnbirid effect nf a l l u n o k r v d snb,jrctspecific c m i a t m , oftcn rcferrd to a. unnbserved hetemgeneatu. The rmdtm intercepts replascnt individud differences in the cmrdl mean lcvel of the rwpoasc after controlling for covarintes. Rand0111 corflicielltti of comriatcs can 13p nwd to allow for bdween- suhjcrt hetcrr~pcneitp in the eflecls of t tc ravariates. For instance, in longitudinal data, the shape of the response pmfilc may wry bctweea sujubjectu in addition to variability in its ~x~l.iral position. Tf the overdl shape is Knew in time 1,,, subjwts m a y differ rmidornly in thP.ir slopes giving a mod4 of the form

Here tb, is a random irrterccpt a11d ul, a rrndor11 cucfhci~ul or slope d t,,. These mndom effet.t.7 are oswmed to haw s hivariak normal distribution with zero means, vttfianccs T: mtnd rf, and mri i lnce 70,-

They xrc furthermore asurn& to be uncorreI~ted a m q subject* and uncurr~lnted with .E,. or any rd thc covariat~. If the mvrtriate vector x,, includcs ti,, the corrcspo~lding fixed coeffic~wt 3f rcpreserits the mean coefficient of time whewas t,hc random dope u l i represents the deviation from thc mcan codbcicnt for subiect i . (Not including tVi in the f i x 4 pHst nf the modal rvould imply that the inean slnpc is zero. The mnrlcl can dso inr:lucle ionl linear f~inctior~s of t<, rvpicrtlly p n w a of b,,, whose corficlents m y hc fixed or random.

The totd midual in (9.2) b 1iuI + ?~,~t,, + E,? with variance

which is no longer ranstant over tirna bnt hetcmuceda~tic. Sirnil-IF. thr c w ~ r i x n c ~ between ma totd re,iclids o l t,lie tams subject: I is not constant over tirne. Jt should dso be noted that both t.he rand- intercept vnriai~r~ ant1 the correlation bctmen the random coefficiecr, nnd rlu~drlrn ir~terwpt depend OII tllc lucation oft;,, i.c., re-estimatix the modcl dter adding a construrl to t, will led to different estimate- I

Page 182: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Gsr~erai t e r n ~ [or random intercept or random corficicnt modcls m rnndom effects modcls. mized f:flectq or m i d modcls, WIICTF: .'mix& relcrs to the presence of both 6xed cfrects P and raudom &cms %i and u l i . The models are also hzerarchiwl or d t f lmt l s iu~ce the denlcntary ubsc~vations at ttfiF ind~vidual time points (level 1) arF: ncsted iri sub- j&s (level 2). The models dismuswd iri this chapter Rre appropriate for any kind of d u s t m d or tw~lcvcl data, not just Iongitudiual. Other examples of twwbvcl data are peuplc in faruiliw, hous~holds, neighbor- iiootls, cities, srhuols, hospitals. kms: etc. In a11 these tvpm of data, nc can generally not assume that rmponscs for s~ubjects in the same cluster are independent after controlling for eomriil,t.es becauec there i s !ikelv to he unobs~ncd hetero~ensity betmen cliwtcrs,

For non-normal responses (For exampl~, biuary rcspomes) \r.p can ex- ycnd the gcnerdized linear model discussed in Chapter 7 by introducing a random h t c r c g t u, into the linear predictor:

tvherc the ~ i i are independently normilly distribiited with mcm zero and variance 7'. (Itre have encounted a sirr~ilrv model in Chapter 7, :~amcly the ~igat ive binomial model with a log link and Poiawn dis- -rihnt ion where h ~ s a log-gamina disiri buition, w d there i s only one ~ b s e r ~ t t i o n pex subject.) We can further cxt,md the random interrepi niodcl to include random cocfisieuts a? we d ~ d iu the preriws section.

1 Unfortur~ately, such genedued l i m r mazed rnodeb arc difficult: to tima mate. Thiir is becausc the lihiihoud involves integrals urrer the :andunk ctfkts distribution and these irltegrals genernllj* do not haw ~~lowd fornrs. Shta uscs niirucrical iutcgat~on hy adaptivc Gauss- Hermit-a quadrature for random iutcrcept models. A uscr-written pro- Tarn g l l m can be used to esti111ate r~nd0Fn cocfirjcut models by ~daptive quadrat.rxe. The program can also be used to estimate multi- : ~ ~ - c l models with ruow rhati t\w lads nf ncsting (RabeHmhth e L a[., 2005) Xote that apprr~xi~nate luctl~ods such as p ~ d i z c d quasi1ileli- ::ood (e-g.: Brdorv and Clayton. 1993) and its rcfn~nncnts do not tend -0 \\lark u d l For data with dichotumous m p o n w and s d l club%cr sizw ~iir:li as the thought disordw data (sec aim RebeHesketh et d.. 2002).

An importaxil problcm with m q longitudinal data sets is the oc- -tlrrwce of dropouts, c.g.. subjects failing to mmplele all scheduled :isits in the post-natal depression data. A tnxonurrry of dropouts is riven in Digglc et ml. (2002). Forkimatdy. maximim likelihood &irna- -inn IS consistent as long as tthc data arc missing at random (MAR),

Page 183: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

that is, L ~ P pmbahilily of missingnssrj does not depend on the valnes that are missirq. Phr cxarnplc, if thc modcl iscorrectly spe~iiied, rvc oh- tain consktent p~lrnn~eter eatimtcs pwn if the probability of dropping uut depends on the rcspur~scs at carlirr time points.

Usrful bot~ks 011 random effects modding include. S n i j d ~ ~ s and Bosker (19991, Verb* mcl Ivlolenherghs (2000), Goldstcin (2003), and Skro- ndal and h b o H ~ k c t h (20041, as well genad books on longgiludiml data such as Lindsey (1999). Rzabe-kIea;kcth mid Skrondal (2005) is a book u ~ i "Multilevel and Longitudinal hlodeJixig Using Stata'.

9.3 Analysis using Stata 9-3-1 Post-natal depwssiom data A5 an examplo of cur~Linuous mponsm: we first consiclw thc p o s t - ~ t a l clcpression data analyzed in thc previous diapter. The data are read using

infile subj group pre depl dep2 dep3 dep4 dep6 dep6 /// using depress.dat, c l e a r

kLl rcsporlscs must be stacked in %single vaiable, including the baseline smrc pre. This is diicved by first r~narning pre to dapO and then using the reshape w~nmand:

r e m e pre depO reshape long dep, i(subj) j ( v i s i t )

label define treat 0 "PPacsbo" 1 "Estroga" label values group treat mvdecode -all, mv<-9)

Wc now wkirnat~ it random intercept modd using xtreg. (Note thar wrnmlu~ds fur longitudind data hara Ihe prcfix xt in Sfata whick staurds for cross-scckonml lame &ewes). Fust Rswme that the rnmear dcprmsion score dwline 1in~;trly fmrn b~wline ~mth different slopes ir the trvo groups:

generate gr-vis = group*visit xtreg dep group v i s i t gr-via, I(subj1 mle

The syntax is thc same a? for regress except that the i 0 option G used to spocify the duster identifier, herc subj, and the mle option rc obtain maximurn likelihood cst,inialm.

The es%imates of t . 1 ~ "fix& regrwrion mefficients that do oiot var; over individuals are given in the first pHst of tlic t a l c In Display 9.1.

Page 184: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Rundom EjJects Mudeb: Tlauu,qhd Disorder wid Schizophmrdts W I79 --- ------- -

Random-effects KL regrassion Group var~abls (i): mbj

Random effect8 q i - Gaussian

Log likelihood = -1018.7117

number of obe . 356 Hwber Of group8 61

Oba per g~oup: mln = 2 av& - 6.8 max - 7

tR chl2(3) - 225.74 Prob > chi2 - 0.0000

- - - -

Coef . Std. Err. z P> l z l t95% C o d . I n t e s v d l

-1.644653 1.163462 -1.41 0.167 -3.924908 .6358901 -1.631905 ,1736977 -8.82 0.000 -1.&'234€ -1.191464 -.6664469 ,2220225 -2.51 0.012 -.9916031 -.1212908 19.29632 .s71?$59 22.13 0.000 IT 6a7e.g 21.00495

-

tikeliltood-ratin t es t o f sigtoa-u-0: chibar2(Ol)- 114.03 Rob,-chibar2 - 0.000

I Display 9.1

whereas thc cstimatcs of the s t a ~ ~ l w d dcvirttio~~s T oi the random ill- re~ccpt and LT of thc redduals arc giml under /sigma-u md /sigma-e 111 tllc second part The intracl>us correlation rho i s estimated 0.45, inplying +.hat 45% of the residual variance is hetnrec~~ subjectt; and .ij% within s~~bjccts. There is a significant int.e~actic>~i be t~wn group il11d visit a l the 5% level; the rrlcnn decrease in depressio11 score is &stid r r in ld a~ 1.53 yci' vkit ill the placebo group and 1.55 + 0.56 pclb visit in rhe mtstmgeri gmup. We can ohtdn the estimated slupc nf tirnc in t i ~ c rstrogen grwp with iCs pmlue mcl confidence interval using lincom:

lincom v i s i t t g r - ~ i ~

i scc Display 9.2)

*P Coef . Std. Err. z Pllal C95% Car . Interval1

(1) -2.OQB362 ,1384264 -15.09 0.000 -2.959E63 -1.817041

Page 185: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

180 . A Ilwdhbnuk of ,Stah5Lzml Analyses Usmg Slats - -------

The modd assulnes t lmt the eNcct of visit is linear. BOWPW, il may m.cll he tllcal t.he dcpressio~~ ?core g ~ d u d l y lcwls oE, remxi~~ing s~xhlt: after surrrc period of time. WF can iinvest,gale t.ks by adding a quadratic lerm of viait:

generate via2 = visit-2 xtreg dep group v i s i t gr-vis visa, i(subj) mle

The gmluc for v i s 2 in Display 9 3 ssumesi5 that thc average c11rvc js not linear. To picture t l ~ c m m i curve, wc now plot it togethex with

Random-effects M ragressicm Group variable I t ) : ~ u b j Randm e f f e c t s u-i - Carrasian

Log liksllhood - -10'24.3838

Ilba per group: min - a avg - 5 .8 m x = 7

LR chi2(1) = 268.39 Prob > chi2 - 0.DODa

d q 1 C w f . Std. E n . a P>IzI [95% CQnf . ToterVal1

Display 9.3

t,ht ohsend individual response profiles:

predict pred0. xb sort subj v i s i t twoway (line predO v i s i t , conn(ascending) lwidth(thick)) ///

(Tine dep v l s i t , conn(ascend1ngl IpattIdash)), /// by(group) y t i t l e (Depression) / / / legelld(order(1 "Fitted mean" 2 "Observed scores"))

giving the graph shown in Figixre 9.1 which s~iggets chat thc rebyow curves tcnd t,n Iex1 OR.

Extending t.he model to incluck a rmldorn coeffci~nt of visit re- qrlircs the xtmixed commx~lrl Chat lvns introd~~ced in Stata release ? Firfit we I-estimate thc ~ m d o m intmccpt, modcl ~zvirlg xtmixed:

Page 186: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Raradwrs E k l s M d e h . TInnughgFat DuorrIt? ond ~ch+zophm~vn . 181 -

F~gk~rc 0.1: Rmrpclnsc profilm mrl f i t ted nlcnn curves 1~ lrcntnie~tt, :roup.

xtmixed dep group visit gr-via vis2 I I subj: , mle

I lrrr thc iixcd part af the rricldd is sparifid KS i ~ i all cstimntioxl caul- .r~nnrls, mrl tile random part ia sl)ecilied dlcr the do~~blc-bnr I I. First, - 11c cluskr-iilmntifisr is givcn, followd hy n colol~. T l ~ ~ l i dl 11xplrmabory inrlnthcs that shni~l(l have rndom corffic~iax~ts varying hetutwr~ clus- :prs arc listed. A rt+~~dorri it~lci'cept is autu~r~atic~~lly iliclu~iml uitlmq *hc nocons optiu11 is uscd. Hme we only rcyuirc R ranndorrl ir~lcrcgt, .I) nn vxriwt)lss arc liskd after subj : , Aftcr i,he rclrnrrtu, wc use t , k d e opt,ion to sllecity maximnm l i k l i b o d cstilnat.ior1 (tlic dcfanlt is :rstrictcd rnnxirn~lrn likelihood wtimlttion).

The oulput shown irl Display 0.4 ww perfectly witch that frotn xtreg with sdLcons) correspotlding to /sigma_u a d sdlResidua1) io /sips-e.

Tllc most crmrwn ~octhod of prrdictk~g random affects is l y their j~ustcrior means, their expectations givez~ the obswv~d rwponscs and rmn~wiates wit11 the parameter cstilnates plused in. These prerlicliol~s d r c also known m emparerrnl Blaqes prediction,,, hhri~tkage estiniata rrr, in lihear raixed models, hwt li~lerw: unbfmd predictions (BLUP). \drling predictlo~ls or the ranrlorn iritcrccp' to t.he predicted mean r e iponse profile girm irtfivid~lnl predict4 PCS~RWW profiles. These 1x11 bc uhtainrrl &or csl im~tion with xtmixed using the predict c(,rlniniuid

Page 187: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Coef. Std. Err.

group -1.47139L 1.130398 -1.29 0 197 -3.704667 .7617841 visit -3.787308 ,9710723 .-10.21 0.000 -4.514596 -3.060019

v - v i s -.5848499 2073853 -2.82 0.005 -.9Q13176 -.I783821 .3B519lQ ,0589327 6.77 0-OW ,2736066 ,4967775

-cons 20.91077 ,8860139 23.60 0.000 19.17429 22 66733

Display 9.4

Mixed-effects fn regreasion h b e r af obs = 358 Group variable. sub] Number of &coups = 61

I Dbs per group: mln - 2 aug - 5 .8 max - 7

Weld chi2(41 - 409.15 Log l ikel ihood = -1024.3838 Prob > chi2 = 0.0WV

Raoaom-effects Parmetera

subj I Identity ad<-cons)

sd(R8sldual)

Estimate S t d . E r r . 105% Conf. I n t e r n ]

3 . W 6 6 .9868167 7..0011%7 4.428191

3.678709 ,1508831 3 .39657 3.988%

tR test us. l h a r regresston: chchibarX01) 133.51 Prob >= chibar2 - 0 . W

Page 188: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

--it11 the f i t t e d optiou:

predict predi, f i t t e d

1 g a p h of the individual pmclicted profile is obtaiein~d wing

twoway (line predl v i s i t , connect (ascending)), by [group) /// ytitle(Depressi0n) xlabel(0/6)

. t ~ l ghen in Fiplrc 9.2. It i s dclcer that thc Inen11 profilm in Figure 9.1

Figure 9.2: Pretlicterl WG~OI ISP C I I T V ~ for eatndon~ int~rwpl model.

:i;nre si~riply Ileal shifted up ant1 down bn fit tllc ubserved individt~al :~rofilcs tnore clowly.

1% can also obtaiil empirical 13aj.e~ predictions of thc rnr~do~n in- -rrccpts thcmsclves ~ i s j n g the reff acts option

predlct inter, reffscts Unfortunately, xtmixed docs not pddl~ce standard errors 01 the

>redictions at. the timc nf writing this hook. To obtain Ihesc. w r will i r ~ a user-contrihutpd program gllamm (for gencrdi~d liricar latent

and mixcd models) dwcribcd in Rnhe-Heskdh d al. (2002), (2004b) ,air1 Rahc-IleskPtlr and Skrundal (2005) (we also uuw.gllamm.org). TIIP program can he ohtained from tl lp SSC mdiive using

ssc install g l l m

I\+ wiil first rwt imate the m d a m interceyt marlei using g l l m :

Page 189: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

184 A Ilondhook of ,~ lalwi id A7lcal$tdc+ Usiray Slrstn .------------------------- --

gllamm dep group v i s i t gr-vis visa, i(subj) adapt

Thc syntax iif M for xtreg except (hat the mle optiori is not req~ured wince. gllamra abays uses maxilrlurrl lilcelrl.lond, find we have sporified adapt ta use adaptive quadtatur?. The esliroath. arc sl~o~tm in Dis- play 9.5.

number of level I =its = 356 nmbr of level 2 unit% = 61

Condition MumDer - 88.719052

gllamm model

log liirsllhod8 - -1024.3838 Coef . S t d . Err.

p s u p -1.471391 1.1294 -1.29 0.197 -3.704574 .7617913 visit -3.787248 .3710407 -10.21 0.000 -4.614632 -3.059989 gr-v~s -.5848498 .a073976 -2.82 0.005 -.9913417 -.I78368

,3851916 ,0589339 6.74 0 WO .2T36033 49877% -cwa 20 31077 ,8850218 23 60 0.000 19.1742 22.847%

Variance at level 1 ___r"______--__r-__-___________--____---______-_____~_____-L--_------"1"--

13.532897 i(1.1101[196)

Y ~ l a ) l t e 6 and covarlancea of random effects

-**level 2 (snbj)

var(1): 12.M9S37 (2.7739368) -------------------+-----------+---------------------+-------------------

Display 9.5

Thc for~nxt of the output Sot the random pmt is somewhat diE-- erlt frum Lhat of xtreg and xtmixed. " h i a n c c at level 1'' refers - - the rcsidilal varianw 02, wherrlls varcl) undw "Varia~cos aud c m ~ r - ancc*: of rnndnrn cffec.tts" r ~ f t r s to the ra~ldorn inkrt:opt variance with stnndarti errors given in pmnthe

v

not very uwt111 and nejt.hcr arc the st deviations repurtd by xtreg or xtmix tlonn of the estinwtes arc unlikely Lo be

Page 190: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

distribution ) All mtimutcs from g l l amm arc 11early iderhical tu those wing xtrag

or xtmixed. This will wt always bc the rw since g l l m mses numm- ical intcgrat.irln for all models, nlhelelens xtreg and xtmixed exploit the avtulahility of a c:loscd fo rn~ likelihood for linear mixed models. (Note rhat wc X T O I ~ ~ ~ not gmeI'~lly rcommcnd rwirtg gllamm for linear rnlxr~I ~nrlrleh but do so hcre tc) obtain staridard errors For t11c prsdi~tcd ran- dorn cflccts.) Iri gllamm thc w.cluttcy 06 bhe estimates GVI be i~~lproivd by iricrcasing thc number olq~~adrature points ror nurr~ccrical i~~tcgrat~irm from i t s default 018 using the n i p 0 option.

IVe r:m obtain the cmpiricd Haycs protlictior~ of i.r, wiLh stsndard orrars using gllannn's prediction cotnmand gllapred with the u option

gllapred rand, u (means and standard deviations s tored in randml rands11

wl~ich crcatw urn vrariiablcs randml ror the predictions mind rands1 &or the standard errors. T l ~ c stnndxrd errorti are posterior standard ,le>iations which me e q ~ ~ a l to tlic prediction error standard deviations !or Lirlcnr models. Tn the multilevel litcrp_tnre, Lbew standmd ecmrs are dsu known a? "compnrativc stmdard crmrs". Fimt we rn& surc that -!le tmpirird Bays predictions are CIW to t h e prcvio~~sly prodlirtd 77- xtmixed wtl stored in th? vaiable inter:

assert abs(rmdm1-inter)<le-3

The predictions E~TP equal to at Icust t h r ~ e dmimal places. -4 graph r ~ f thc predictions with their appl*oximatc 95% mnfidence

.utervnls for tlic plamhu gmtrp is t l s n ohtttincd 11sing-

generate f = visit-=O sort randml generate rarA = aum(f) serrbar randml se rank if visit==Okgruup==O, scale(2) ///

xtitle(Rank) ytitle("Random intercept "1 -MI the result shown in FIgure 9.3. In liricu mixcd models the prc- lictcd random effe~qs sho~ild bc normally distril>utd, so we can use - ~ ~ ~ p I i s to atsws the as~itmption that thc "truc" random oEectrj are :clrrrlzlly distributed. One possibility is a kexnel density plot with n r.firrrial density st~perimposed:

kdensfty randmi if vislt==O, epanechnikov normal / / / xtitle("Predicted raadom intercept")

12 Figuw 9.4, t,he r~ormd density appcnm to approximate the e~ripirical :cndtp well. Kote tlint this diagnostic cannot he 11sd for grncraliz~l 111ear mixed models whew the predicted ranrlum effects are geucrdly :qrl-normal rwn under cot-rtrl spccifiwtion.

Page 191: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figure 9,3: Predicted random irdertxpts md approxinmP! 95% confi- dencc intervals for t,hc pl~aebo goup (tr~stl on thc prediction error aki~idnrd deviations).

Page 192: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Ihndom Eflccts Models Tiaorrglat Lhrom'm und Schiauphmrta H 187 --------------- -

U'P u~ill now allow the coeficient of v i s i t to vary rnntlomly be- --per1 slibjccts 11y i~icltlding a random slope in t,lic model. TlQ? cm - rlrlnc using the xtmixed mrnnimd. (We recommetd using xtmixed - - r e d uf gllamm 11me because g l l a m is slower and sornc1.irn~~ le~q . r~~ratr! tl1a11 xtmixed lor l i n ~ ~ mixwl rnodds.) Now wc? list a single :r~abIc! v ~ s i t , in the ra~~dom part afier wbj : to mrluest n ranclom ,.fit:jent tor this wirtl>lc in drlition lo a mnilom intercepi. I3g de-

- :.~lt, xtmlxed specifies all rarldom effects HS mntwdly iudepcnlient. . . . ,: rhcr~fa~e s ~ c d g llle covariance(unstructured} optio~i, abbr~vi- .-d cov(unstr), to freely wlimnta thc corrdufion between intcrcxpl

.1,1 slope.

xtmlxed dep g o u p visit gr-v~a via2 1 1 subj : v i s i t , /// cov(uustr) mle

'.. Divlay 9.6 we we that Lhe oi~t.IJut tinder "hmdom effects pamme- - + h ~ t i bemlne rnnre co~lglex. Tlic ~.andnrrr intesrnpl ntund~irrl dcvi- . - ! ~ I I hta hrun astimstcd us 3.11, blrc ~ E L I ~ ~ O I I L slope stu~dard dcviatior~

r).ljl. and Lhe corrcl~tion between intcrre:l,ls u11d slopw tur 0 09. The 7 :*bin-subjecl residllal xttlndwd tleviation (arotlntl thc subject-apcdfic -- rwssiotr lines) has been ~stin~nlttrl as 3.46.

'l'hc lo^ IikeliElnod of this ~rlodd is - 1017.27 compared with -1024.38 : : rhc rmidorn int.ercppt inodcl. A ta~lwntional likciihootl ratio t rst r ~ l d cnmgnre LW~CL' 1I1e di&lcncc in log likclil~aodu with n chi-xqiql~rtrtul . -rrihl~tioti wit11 two degres of f ia~rbr~~ (for a11 extra vnriw~ce u~rl co- -.rlmlce p&r~~rseter). Bo\wcr, the r ~ ~ ~ l l I~ypotE~wis t.hat t h ~ slope hns

-- ro ~ ~ r i a u c c lies or! the boln~dqy of the ~)nrar~~cter s p m (ciinr:e n wtri- --:re cnnnol, bc ne~ntivr!), and this tmt i~ ther~fort, nclt. valid. Sri i jd~t ' s --:d Boskcr (1909) siigg& dividixig the p v n h ~ c of Ihe conlvc11tiontll lib

.outl ralio tcst by two, giving u higltly signilicant rcsnlt hesc. Tile ctnpiritA Baayes predictions For Lhe rm~cloru cucffic~r~~t mvdd

,II I?e obtained meld plotted 11si1ig

predict u*, refiecte

twoway scatter ui u2 if visit==Q, xtitle("lnterceptq'~ /// ytitle("S10pe")

z , r i n ~ the glbepl~ in Figul-a 9.5. We mdrl ttgdtl assess Lhe nnrn~dilily of - i. ra~tllclom ~ffocts gr;iphically.

Tllc prcdic:tcrl profile? can be computed and plotted usirq

predict pred2, f i t t e d sort subj v i s i t tvoway (line pre8.2 visit, connsct(ascending)), By(gr0up) ///

ytitle (Depression) xlabel(O/d)

: s ~ ~ l t i u g in Kgnre 9.b where the curves axe now no longer pardlcl due - , the rmdorrt sIu~E?~.

Page 193: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Mixed-eifecss HL regression Humber of oh8 356 Group variable: aubj Number of groups - 61

ms p r goup: m h - 2 avg - 5.8 max - 7

W a d chi2t.4) - 267.33 Log likelihaod = -1017.2722 Prob > chi2 - 0.01100

LR t e s t w. Itnear remasion: cb13(3) = 147.73 Prob > ch12 * 0.0000 Mote: tR t e s t ~s c m m a t z p e md provide8 only tor rnferwca

Display 9.6

aeP Coef. S t d . b . E ~ I z ! 195XC0eQIu tnrva l l

group utslf.

p - v r $ vir2

-corm

-1.471681 1.021316 -1.44 0.150 -3.473301 .5301793 -3.779156 .3749743 -10.08 O.OW -4514093 -3.04422 -.5870938 ,2881352 -2.19 0.029 -1.112629 - . N i 5 5 8 2

.3889214 .as36412 7.25 0.009 ,2837896 ,4840693 20.90318 ,7971391 26.12 0.000 19.34081 22.465K

Page 194: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Fimrc 9.6: Prediclcd responsc profilm for rru~dom coefficimt model.

Page 195: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Fiudtlly, rvt. can - the fit of thc rnodel by plotting both ohrvRd and predicted profiles in a trellis graph coritainirg a scparatte scatterplot for cwh suhjcct. For thc placcbo gro11p the conlrnrrnd js

twouay (line pred2 visit) (connect dep visit, Lpat(da8h)) /// if groupm0, by Csubj , s t y l e (compact) ) y t i t l e (Depress~onl legend(order (I " F ~ t t e d " 2 "Observed"))

and similarly for the Lreatmml gi-uuy. The rea~ilting graphs are s h m 1 i r ~ Fikwre 4.7. The model appcrtrs lo reprewit tllc data rcasmabiy n-f?Jl.

9.4 Thought disorder data Thc thought disordex data are read in using I

use madra~, clear Xext rw stack the dichotomous mpomm yo lc y 10 into a single vari- able y: and crente a new vtlriable month taking on d u c s O to LO using

reshape long y, i(id) ~ ( m n t h )

1% wish to iwcstigate how the risk of having thought disorder evolves over time and whthee~ there are differences betwep~l early and late onset, patie~ils. An obvioi~t: first model to cstirnatr is a logjstic random intem~pt model with fixed effects of month, early and thcir interaction. This can be dune using Stat,a's xtlogit command:

generate month-early = mnth*early xtlogit y month early month-early, i ( i d ) or

The output is sl~uum in Disphy g.7. The or option w; u e d to o h t h ndds ralim in the fir&? part of the tabla. These suggtsl that tlrcre is a derrcme in the odds of having t,hought disorder over time. Hmqver. pat.ients wilh early onset .schizophrenia do not cliffcr significantly from late o m t patients in thcir odds of tfioughL disorder at hhc time of hospitnlizatbn (OR71.05) nor di) their odds change nL a significantly diflereril ratc m r timc (OR=0.91). The log of thc rando~n intcrmpt star~dard deviation b estimated ar 1.02 and the standard deviation itself as 1-63, Here rho iis the estimated intrtlclaqs correlation for the latent rmponseu,

scr! the latent reponse! forlnulat.itio11 of the orcliml logil modd in Chap l.er 6 .

Page 196: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Estrogsn group

Figurc 9.7: Obsc~lrcd and prodiebed response profiles for random cuef- rjcicrit modcl.

Page 197: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Random-effects logistic regreasion Number of aba - 244 Group variable (11 : i d Wmber o f groups . 44

Random effacts u-i - Gaussian Obs per @up: mLn - I m u g - 5.5 m a x - 6

Wald chi2(3) - 37.n LOK l i k e l i b & - -124.77883 Rob > chi2 - 0.0000

Display 0.7

Y

month early

mmth_sarlv

e i p a - u rho

Tltc same model can be wtimatsd in gllamm whi& nllow posterior rncnns a~id other prcdictionr to be rmmputed ihsing gllapred:

[IR Std. Err z P21zl 195% Cad. fntervall

.$I395515 .OW7932 -5.29 0.000 .S77056 .7768982 1.047054 ,9092773 0.05 0 958 ,1908854 5.74335 ,9358536 ,1302074 -0.48 0.634 ,7124893 1.229242

1.192157 ,3439486 1.079406 2.466822 .M74342 ,1042017 ,26171 .6490836

g l l m y month early month-early, i(id) liak(logit1 /// f a m i l y (binom) adapt ef o m

estimates store mod1

Likalihood?atlo test of rho-0: cbibax2(01) - 26.04 Proh s= chibar? = 0.000

Here wc r~sd svnthx. very similar l o that of g h with the link(). family ( 1, and ef o m options and storcd t.he eslirnatm for later using estimates store. In Ilisplav 9.8 we can sps that the c s t i m x t ~ are quite clow to t h ~ e uxjng xtlogit Elmt!vcr, there arc wnnc small rlis- cl*cpxr~cics ~ W F L ~ I S C the two programs use rlifkrcnt algorilhrns and have differr~lt. &faults Ibr the number of quadrature points ( 1 2 in xtlogit and 8 in gllamm). Increasing the l~urnber of qnadratilre point?? for gllamm to 12 using the nip(l2) option givcs virtually t l i ~ same es timatcs a? for 8 points, suggesting that 8 points are s~~tficicnt,. In- crcasntg Lhr: numbcr of q w h t n r e poi11ts for x t l o g i t l o 20 using t,he intpolnts(20) uption gives cstirnxtm Ihat are closer t o thr g l l m ~fililnates nl>ove.

tVe IIOW inclu~lc randurn slopes of month in Lhc mvdcl,

W?l:hcn tlrefi are several random effects (hers intercept arid slope)! n-p haw to defir~c &an aq~lat.ion for a i d l of therii to specie tlie wirinblr: niulti-

Page 198: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

ihnrlu~ra Effcl-b Mod~Lp. Thrnghd Uisordrr nnd Sch'i~u~hrpnm . 193 --------

:umber of level I unita = 244 :umber ~f lev41 2 units - 44

Czriances and covariances of random effecza .-----"-"---------------------------------------2-LA2L-m-------------------

--*level 2 ( i d )

1 :~Iylng Il~e rmdo~n effect. The rmdnm intercept ~ ~ f i , in qllatioa (9.2) i s I :nt n~~lltipliml by nuything, or eqi~ivalcrit,Iy it's multiplied bv I, whereas

, - ~c ranrloln cocf6c:ient tq, is mullipliwl by l;,, the variable month. W e i~rrfoi-c define t,hc ctyul*t,ions as follows:

generate cons = l eq inter: cons eq slope: month

Ttlc syntax for the equatiows is simply eq Idel: unrlsl, wIiare label - c ~ l ~ arbitrary equaliori namc. I\% call now run g l l m with two stra, optior~s, nrf ( 2 ) to spccify that there ;arc ttr-o random effects .*id eqs(inter slope) to definc the variables ~ni~ltiplying the rnl~durn -LcYts:

gllamm y month early month-early. i h d ) nrf (2) /// eqs(1nter slope) IfnkIlogit) family tbinom) /// adapt efom

estimates store mod2

--ling the outpnt. ill Display 9.9. The log hhlihood kw ciecrc~ed by r u ~ ~ t 3.5 suggsting that the ranrlum slopc is newlcd. The fixed cf fe~ ts *i i1n;tte.q arc wry similar to t , hw for the randurn intercept morlcl . The

Page 199: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

estimated vuinnces of the intercept aud slope are t lenotd var(1) and var ( 2 ) respe~tivcly, and t,heir m r i a n r ~ : cov(2, I). (The first random effect is the rmdnm intercept since inter was the first eqwtiori in t.he e q e 0 option.) We see that the estimated correhtiori between the rmdorr~ intercept=^ and slopes itt the tinic of hospitalization is -0.71. Note that hoth the random intercept variance and the covariance and wrrelntion rcfw to the situation whcn month i s zero and change if we trmlslatc month by adding or subtrxting a constxut.

number af level 1 unite - 244 nrrmbsr of level 2 units - 44

pllm model

log l ikel ihood - -121.19978

y exp(b) Std. Err. r PZlzl !95% Conf. ~ n t e r v a l l

month .6031122 ,0769518 -3.96 0.000 ,4696698 ,7744883 e a r l y 1.0391M 1.252062 0.03 0.975 .G478712 11.02232

month-early .9377642 .1937734 -0.31 0.758 ,6254614 1.406975

VLVI ia~ce8 and cwariauces ed randm effects

***leva1 2 ( i d )

Display 9.9

Wc now produce wme graphs of the rnor1el p~dictior~s, considerins first thc random intercept model. For t,hc ptat-natd dep-ion datz tlrc mean profile irk FEgx~e 9.1: was simply equd to the fixed part of tk random intercept morlel since the rncm or the: random intcrcep- is zem In thc Irgistic rnod~l, thinas m a more complicnted because tt?- pmbal>il~ty of thought disorder given the mllrlurn intercept. (the ~ubjac--

Page 200: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Rnndom Effexb 11,fadel~: 171oughl Ursorder and Sdiwphwlpa . 199 - - - - - - - - - - - - - - - - --

.:irc$c probability) is a nrmliucicar function of the random intmcpt:

The populution averagd probability is therefore not qua! to t,hc above --irh u, = 0,

7 h r ~ e g(ri,) i s the normal probability densi@ funrtiw of u, For this -- sron thc codficients P. repwwming t h ~ conditionrsl or subject-smtfic k t s of mwiatcs, for a given d u e of the random effect, cannot he :~r~rprr?ted as populataon a z ~ w g d or ma@raal efl~xts. The marginal

- ' f ~ t ts tend to be ~ l m r to zero or -atteslu~tcd". (Kote that hcre the r.1 m '~mmi:inal ~ffcet5'' means pop~rlntim~ a~qragd effects, ri GTW dif-

'-,rtnt notlon than 'mnrgind effects- in crmmctr ics zs computed by -.le Statn r u m m d mfx.)

In gllapred UB can use the mu and marg options to obtain the wrgmril probabilities on the lef-hand side of ((0.5) hy numerlrd ink- -7~t1on and EIIC mu md u s ( ) options to obtain the cond~tional probn- -8rli1ies In (9.1) for pen ~mlucs of u.. Tu obtain smootli rurues, first rcate a ncw dd~tas~t tvhere month increases gradually [rum O to 10 and

early equals 1 :

replace month = lO*(_n-I)/(_H-L) replace early = 1 replace month-early = month

.\ow XVP can ohtain marginal probabilities for the random interccgt n l o d ~ l hy first restoring the estimates and then using gllapred:

estimates restore mod1 gllapred prcbml, mu marg

To colculatc cunditional probabilities, nr: must first d d n e mrkblt.s v c p l d to the values at \which m-e wish to evaluate a; (I) and f 1.7, a p ;)~wimat,ely orie standard deviation), The writlble n m a must, md on -1" sirice the random intercept is rhc first (and hmc the unlv) randurn pffect:

generate ml = 0 generate 11 = -1.7 generate nl = 1.7 gllapred probcl-m, mu us(d gllapred probcl-1, mu usll)

Page 201: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

gllapred probcl-u, mu u s ( ~ ) drap m l 11 ul

Hcre the us (m) option spwiiies t.hat. if there were Irlorr! that1 one ran- dom effect. the \allies rr.oidd be in ml. m2, ctc. Before producing Ihe graph, nr. nil1 make prcdictium for the random c.o&cimt model:

estimates restore mod2 gllapred proh2, mu marg generate ml = 0 generate m2 = 0 generate 11 = -2.7 generate 12 = -0.4 generate ul = 2.7 generate u2 = 0.4 generate u l l = 2.7 generate a12 = -0.4 generate l u l = -2.7 generate 1112 = 0.4 gllapred probc25, mu u6(m) gllapred probc2-1, mu ns (1) plllapred probc2-u, mu us (u) gllapred probc2_ul, mu us (ul) gllapred probc2-lu, mn us (lu)

W? haw produwd fiw m n d i t i o ~ l pidictions ollc with both random effwts eqilal to 0 aud foiu for all combinations: of high nild l o r r y values of the random intercept and slopc. The gi'nphti nre obtained uqing

label variable month / / J "Number of months since hospitalizationg

trroway (line probmi month) /// (line probcl-m month, lpatt (shortdash)) / / / (line probcl-1 month, I p a t t (dash)) / / / (line probcl-u month, Ipatt (dash) ). , /// legend(arder(i "Marginal" 2 "F~xed part" ! / / 3 "Conditionaln)) ytitle("Predicted probabilityn)

and similarly COT the random cn~ftieicnt model: scc F i p i - ~ 9.8 and 9.9. Tn Figure 9.8 il is clear that t11e n~arginal or population swaged cnn-e is flatter than the conditional or silbject-specilic cururs. T l ~ c dott& c:un c for ,u, = [I rcprwwts the cilnv of m average individual sincc P is thc mean uf the r,mduni ~Eecis dibtribiitiun. Nutc that this cnrvc of an aterage or typical indivirl~ial differs From ihe populatiorl average curry! Jn Figure 9.9, tl-p can sre hon- t i i e r ~ n t lhe trajecttsics for dif- ferent patients can be in a randon1 mficicnt model. A k i n . the cunr of tlw average ind i~ idua l differs from tllc averaged curvi. Alchor~$ the conditior~al predietioas ror thc two models: are quitc rliffercnt, tk ~narginnl prerlirtio~ls are uearl~ t h c same as c a i br see11 ly plotting tk t x n rr~arginal c n n ~ s or1 the SRllle graph:

Page 202: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

: ,1,, 9.9: Conditional and marginal predict~d probhilhies for ran- - -7

i rrn rorffificd niOdPi. he .iott,ed cure ia Ll~r raniliLiund ~rohahif i '~ 7 5~11 U* = 0 md lAli = (1.

I

Page 203: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

198 . A IInndhovk ~ J ~ ~ L u t i t i c u d Anal9,ses Using __-------__--_--__-----

twouay (line probmi month) /// (line probm2 month, Ipatt (dash)), /// legend (order ( 1 "Randm intercept" J / J 2 "Random int. & slope")) ylabsl(0(.2)1) /// yti'tle ('Warginal probability of thought dlaorder")

(sec Figure 9.10) Bcre the ylabel0 oplirm was used to extend the y- axis range to 1 arid pr01I1~r.e appropr'iak i t X i ~ labels to make this grdph comparabl~ with F1gui-m 0.8 and 9.9.

0

0 2 4 6 8 10 MUmbsrd mcrnlhs Slnca hospilalizalion

- Random Intercept - - - -- Randwn lnt & slope

Figuw 9.10: Marginal pr~dicCwl prohal>iIiti~s for randoni inte~cept and rmdon~ coofirisnl n~odcls.

Kate that gllamm can be used for a rvidc range of modr,ls with random ~Rrcts and other lat,cr~t variables s11ch as factors, illclllding (m~~ltilcveI) str~tdural equatiork models and lattr~t, chss rnnd~ls with many d ~ f i ~ r e n t rmponse lypw ay well as lsmix~d responses (set Skrou- dal mnd Raba-Heketh, 2003; 20W: Knhc-IJeskcth el d., 20[)3; 2004a: 2004h). The whpage http : /Jwwu.gllam. orp gives morc rcfercncs and up-to-date information.

Page 204: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

9.5 Exercises 9.1 Thought disorder and schizopkrenla

I. For the tllu~~ght disurder data, pmduoe. graphs or the pre- dic td prol>rthllii,ies fov thc individud pnticnbs ~cprtrtvtcly fur early and late onset (similar to b'ig~~rcs s.2 w d !I.(< for the post-natal cldepr~sion data), IIilit: nsc gllapred wit,kl Llie mu optic111 (not marg w u s ( ) ) t,o obtain pwtcrio~ mean probabil- ities.

9.2 Austrdian school children

1 . Andy= the huSt . ra l i~~ ~ ~ 1 1 0 0 1 children datndcscrihctl in Chap ter 7 using n. Poimon mnodol wiLh n rnnndonl intercept ror each cliil(1 u~rl coinpasc t , 1 ~ wtimatm with tlosc of Lhe rlegativc binomlint madcl eutimatcd in Section 7.3.4. where the mpo- ilcntitktcd rm~dom ir~lcrt:cpl (fmilt,~) hxs x g ~ w 1 1 0 distrih~ltioil instet~l of u log-riormal rli~trihlltiu~i.

9.3 Jaw growth

1. For t11c jaw growth dnla tbsa.ibcc1 i11 Exercis~ 8.3, estimate t h p followir~g rmrloin i~~Lrn~ept niod~l

whscc xi is a du~rinly vrarid)lc for I~eing indc and 1, is tigc - 8.

2. Intcrl>rrrt fill th pxametcr ostin~atm. 3. Extc~~d t l ~ c mudcl tcr allow boy< and girls tn diKer iin tlleir

nlcnu rat.? of growth nnd int,rrpwt tlrc reg~rasio~l cncfticientr. 4. Extend (lie rl~odal further by inclucling a raridorn oocfficicnt d t , and ux a likclihuod ratio tcst to chousc het\veer~ Lhis model and tllc previous ~notlel.

5. For Lha rhosen model, plot the pretlicted growth triijmtorics of tllc c l ~ i l l r ~ n (I ,w~l nn para~rlctes cst.irt~alm and r~ripirici~1 Dtrycs prlulictions of Ihe mndum effccts) by gerder.

9.4 Wage increases

lIerc we considcr the data usrd in E x ~ ~ r i s c s 1.2 and 8.2, and will inakc use of t,he !"ollon,ing nrlditiond varisl~lcs:

nr: pcroon iirlentitier w educ: years of schooling r exper labor rrlilrket cxperio~~cc (Age - 8 - educ)

Page 205: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

expersq: Hbr~r market cxpcricncc squared w married: dummy vmiabl~ for being married

union: durnrny variable for being a n~einl)cr or a union ( i .~ . , n . ~ e 1)ciry: sct i11 cullectlve hagaming qyeern~nt , )

1. Estimate a modcl for lwage with black. hisp, educ exper: expersq, married, and union FLY explanatory variablm and wilh a rarirloni intercept for subjcct.~.

2. Int~rpret the estirnaks. 3. -411 tlltcrnntivc o p p r o d ~ to panel data that, is popular in

~mnomctrics is to specify fixed intercepts for suhjects irmtead ofrandoln oncs; this can be accon~plislld hy ilsing xtreg with the f e oplion (which is eqnimlwt t.o. but much more efficicnt t h using dunmy v a r i a b l ~ for subjcck). FiL thc fixed effects version of the modcl above.

4. Explain why some wriablbltis are dropped hy SStaCa and com- pare tlic rcgrmsion coefficients of the remaining variables with those ~fititilnrtt~rl for the rarldorn ~ r l t e r t ~ p t 1norie1.

9.5 Epileptic seizures and chemotherapy

1. Rnaly7x tlic cpilcptic seizure data. int,rohlcd in the next chaplcr wirg n. Poisson modd with the same fixed part a s spReificd in Scction 111.3.2 and with a random intercept for silbjects Use g l l m with the adapt option. LIAe m e you are using enough quadrature poi~tts by comparing estimates with di&rcllt ~lumberu of quadrattlrt! po1nt.h. ( nip() option).

2. Add a ranrlurn slop^ fnr past and use a likelihood ratio test to decide whether or not, lo rclnin tliis model.

3. Fur the cllmen model, plot the modcl-iinpliccl mnrgirld rela- tiorld~ip bet~wen the ~xppc:t:pCi epilepsy rate rind visit For 2.5 yea,r olds i11 the two treatment groups. (Ilint: use gllapred will1 thc options mu, marg. and nooff set m d plot the pre- dict~om for u1hjm:ts 3 and 46.)

Page 206: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 20

I Generalized Estimating I Equations: Epileptic

Seizures and Chemotherapy

10.1 Description of data 111 :t clinical trial reportt?rl by Thit11 x l d Vtiil (l990), 59 ~ ~ L ~ P T I ~ . s with cpilt?la,v \wrc r;~r~dorrliif~rl l o goulxs rrer:eiving cit.11cr t.11~ anti-rpilcptic: 4 I r i ~ g pvngx1,irle or a glaccl,o ill ~(irlitiur~ t o sr.andnrd clietr~otl~eropy. Tlie rlatnbcr of seiz111.m IVRS collntcd uwr four t.wo-wck pcriotlg. 111 addii,ion, n h~~cliu~c seizure r n t ~ was rccnnlctl fur each pdicnt-, I)aswl ( ~ I I t,hc eight-wcclc prcrar~cln~~~i~~nt.ion seizure colait. Tllc age of each pa- r i ~ n t was also rccordcd. Tlrr li~nixi gn~slinn of irllcrrsl is whet:tlsl' t h ~ :rretrt~cnt propibide rcduccs tlic Rtq11enc.v US epileptic sr?~ul.ps conl- rrareti wit11 piaccl~o. Tlre data are s h n ~ ~ n in TnZ>Ic 10.1. (These dnta .+IKo appet~r i11 Hand t t 01.. 1994.)

Table 10.1 Data in e p i l . dta subj ~ r l 3 1 )'L s:I v l t irnl I m c n p

1 1U,I 5 3 3 3 0 11 31

Page 207: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf
Page 208: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Table 10.1 Data in epil.dta (continued) 5 5 2 2 1 3 .i 4 Y 1 18 32

10.2 Generalized estimating equations In this ckaptcr we mnsider an approach to the analysis of longitudinal data that iis very tliff~rent from random dwts modeling d~czibe(1 in the prcvims tllaptcr. Instead of altempting to model the depcndenr~ hetnwn responses on the same kldividuals a arising from 1)etwmn- wbjcct heterogcncity rapresented by random intercepts and posihly sandom slopes, wc will ronce~l t r te on estimating the marginal mmn ;Iructure, treating the deperldcnce RS a ~u i sanc~ .

10.2.1 Nomdby Olistvibuted responses

If we suypose that a nornlzlly distributed resprlnse is observed 011 each :ildividual at T time points, then the basic regessior~ model for lo~yi- il~dinal data bemmes (cf. cquation (3.3))

:$,here y: = (yt l ,g=~, -. . ,yTT), a: = ( ~ ~ , , ~ i 2 , . . . , E , T ) , X. is a T x ( p + 1) I icsign matrix, ant1 @' = (A!. . . ,&) is B VCC~OP of regression pararrl-

-rcla. The residual tmms are assumed to haw a multivariate n o m d :istril>u~tiori with a eovariancc matrix ol mme prtrtiwlar form that is a 5nctiorl of (hnp~fillly) n small nurr~bc~. of parmnetcr.s. Maximurr~ likeli- mod extimation c m be uscd to evtirnatn both thc parmeters in (10.1) and t,he paramctcrs st mi bur in^ the covariance matrix (detals are givcn :a Jennrirh and Schluchter, 1986). The ldler rut! often not of prirnaw :!trlecst (they arc often refarcd to as nuisance praramctcrs): hilt. using : romriarlcc rrlatrix that fails to match that of ehc repeated Immure- Yents can lead t,o inefficient astirnntcs and invalid stmandad crmrs for -ZP pxdncters that am ol conctlrrl. nnmely the 0 in (10.1).

If each non-replicated elcment of thc covariarice matrix ir tretktcd ac i separate pax-mchcr; giving M unstru~cturcd c m w h c c matrix, a11d if -7erc arc no mivliirig clnta, then this approach is eswr~tidly cqu~vde~l t - multivariate arlalysis of variance for longitudind data (see Everitt, 2 ~11). IIoa~evcr. iL is nf tm more eficicnt to i m p w somr lneaningful --.-ucture or1 thc cmarimce matrix. Tllc simplpst ('and nrost nnreali*

Page 209: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

tic) sti'ticturc is fndependence with dl off-diagorixl cle~nrnts (~.he cowri- anccs) ~ q u d t o zero, and t,yp~cally all diwouel clements (thc wiancs) cqnsl to each orher. Anotlrr cornmonly nscd simple structure, known as mmpcrzad t;yrmmetq (for examnple, see Wirier, Ig'il), rcquirps Llmt all eovrariances arc equal and all varianres are qasl . This is just thc tor- rdation structure of n Iincar rmrlom intercept modcl described i11 the previous rhapt~s cxeept tliat tlie random in te r r~p l rnodcl also rcqr~ircs that the correlation hr positive.

Othcr rorrelrltion structures ir~dudc ai~toreh~essive htructilre where the ro~relatior~s dccreasc wit-h the distanr~ bctvip~n time points. What- wpr the assumed wsrelntion stnlrture, all models may be eslimatcd by maxim~~m l ik~ l i hoorl.

IJnrortiinately, i t i s generally not straigl~tfurward to specify a multi- mrfnte model For non-normal responses. One sol~ition, disaiwcd in the previous rhaptw: i s to inducx rmidual d~penclmrfi nrnong the respunss using randoin effccts. An dternativc approach is to give 111) f,hc idea of a modd altogether by uslnp; gmemZkm! cstimalir~g q ~ u t i o n s (GEE) a= introduced by Liang and Zeger (1986). GencraIixed e*,iinnting q n a - tions nw ewntialfy a m ~ ~ l t i m r i ~ t e extension of the quasi-likdihood approach cliscussd in Chapter 7 (sec also \l'edrlerhurn, 1974). In GEE the parametere arc e u t ~ m n t d usirig "c?timat,ing equations" raemlhing the smre equations for ntoximum likclihood estimation of the lir~eeat model dw,ribecl in tile previous section. 'These ati1rmti11.g c q ~ ~ a t i o ~ only rcrluirc specification of a lirk and variantx: function md a correla- tion structure for the o h d resporiM cond~tiorid 011 the covariate. A4 iu tllc qmf-likelihood approach, thc paranlcteM can he atirnated. erren if thc specification doer not. correspnnd t.o nxiy statistical rriodcl.

The regressior~ meflicient,~ represent inargina! cffects, i.e., they d e tamins thc pop~~latiun averaged relationships. Liang and Zeg~r (1886 show that the estimates of thew coefficients arc valid even when tlw correlation structnr~ is inmrrertly sprcificd. Corm* inferences can te obtnin~tl using roh~ist e s t i w t ~ s of the standard errors h n s d on thE- sandm-icl~ estimator for di~slered data (c g. , Uinder, 19883; Willima 2000). Thc parametem of thc corrchtion m<btrix, referrd tn as the wo~kzrcg cmelrahon mat% arc twatcd WH n ~ ~ w a n ~ e pammete~s. HOA- ever, Lindsey aiid Lmnbert (1998) aud Crotichley and Davics (1999 point sul thnt estimates arc rir, l o r ~ p r co~a iaen t if c~cndogc~ous" cc- vi~rixtcs si~ch w baseline reuponsE5 arc in(t:ludecl in the model. For- tunatcly, inclusion of the bas~linc respomy: ILS a c o m ~ i i ~ t e does yie!? coudstcnt cslimdttm of trraLtm~xlC efferts in c l in id Llhlnl daln such zi

Page 210: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

rha epilepsy data& cunsidmcd hcrc! (sne Croudllcy mid Davics, 19O'J) as long as t.he lnwdd does riot contnjn a hmlirlc by lrealinent uiterirctio11.

T h e r ~ are sollie inlportar~l diifereilces hetween GEE and random rH'ccts mudcli~lg. First, whilc randwn effects modeling is ha%d un a 5~stisticxl morlcl mrl typically rrlnxirn~~m likelihood estiara.tion, GEE is

1 a11 estimation rnctl~od that is not b s ~ l on a statistir:al model. Second! ~ lwrc is an irnport,ant diFerence in Ll~e intwpi-etntion of the r e p s ,ion coefficients. In rmiclmn effmfa models, thc regression coefficierlls represent cvnndiilanaal nr srsbject-sp~~ifir: c f f ~ t s for given val~ics of the rat~doni effects. Fur GEE, on t11e other hand, t h ~ ! regcssion cot%- r i ~ n t n rcprcwnt rnurginal or ylopulatiov~ areraged effects. As we saw ia (he thongllt disorder data in the p rwio~a chapter, cunditiotutl and marginal rslabiorishila can be v e v difbrent. Eithcr may be of intcr- -.st: for insbancc palients arc IikFly to utimt. to h o w the subject-specific 4Xwt of treatments, whercas I ~ a d t ~ h ccnrwtrlisk~ may be interested in :,opulatio~~ averltgwl cFeuts. Whr~cas random efecls rr~odcls allow t ha ~rarginul relntioltship t,o ljr? dcriwd, GEE does not allow dcrivntiori of -!lo conditiond ~*~laiionsl~ip. Not.c that conditio~ial and marginal rela- -ionships arc the sarm if an idsritit,y link is used a~trrd, in the rn of :anrlom irilerccpt modcls (no random co&?fticieuis) : i l a log link is spwi- ?.vrl (sw Uiagle e t id., 2002). Tl~irrl, GEE is oftmen p~cferrcd became, in : I I I ~ L Y ~ % to the rantloin effects apprued~, the piualneter cstirostea are 31nsih3cnt cver~ i f the correlation strnbure i~ rnissyecifid (nlthuugh

-:!is is t , y l ~ c 01i1.v if (;he mean structure is corr~xi.ly specifid). Fourtll! y i l i l e maximurti likc!ihood estimation of a eorrcct.ly specified model is -311sist~iit if dittn arc missin: at. random (MAR), his is not the case for -:;EE rt-hich rcqrrircs tha t rmponscs are ~niwing co~nplct,ely at rarldon~ SICAH); or tltfit* missingr~css depcnds only 0x1 the cnmriatcs irlcludcd -r the mudel. See Hxrrlin and Fill>@ (2002) for a thorough introduction - 1, GEE.

10.3 Analysis using Stata - -7.p gcnernlizcd estimating cquntions approxl~; as askscribed ill Limig c.d Zpe;er (I986), is iinplcmcnt-ed ill Stala's xtgea cornrl~muld. The

3 i r i co~nponcnts 1v11ich Imve to be spedlied are:

I thc msunicd clistribution ol t.lic rwpoIise vnrixble (pj;irren ttic ar vnrintsu), qvxified in the family () option - this detcrmin~s the varia.nce runctitni,

t,hc link hc~wecn tkic rmponse varial>le and its linear prcdi~tor, specifid in tlrc k i n k 0 nptioii. a~id

Page 211: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

w the slruchw of the working corrciation matrix, specified in th-- correlation0 option.

lu general, i t is not necessary to speciFy the link0 option sin=- iw for thc glm corilrnand, the dcfa11It link is the ccannaical link for tb specifier! family.

Since the xtgee cornmand will oRcn he mod with Ibe f ami ly (gauss option, tomther with the idcntitg link function, WY m-ill illustrate tb: option on thc pwt-na~al depression data u s 4 in the prpvlous two c l m ~ ters hafore moving 011 to ded with the epilepsy data in Tablc 10.1.

10.9.1 Post-natal dep~esszon data

T11e data are ,rehtaincd using

inf i le subj group depO depl dep2 dep3 dep4 dep5 dep6 / J / using deprsss.dat, clear

reshape long dep, i(subj1 j (v i s i t ) mvdecode -all, mv(-9)

To-hegin, wc fit a model that regmscs dep on group. visit, the? intcrac.tion and v i s i t quared as in t,he previous eI1apt.m bul u n d r the unrealistic mumption of indepcnd~nrs. The necessary rommmc written out in i t s Tull~st, form is

generate gr-vis = group*visit generate vis2 = vi~it-2 xtgee bep group v l s l t gr-vis visa, i(subj) t(visit1 / / /

corr (indep) link(iden) f mily{gauss)

(see Display 10.1). Here, the f i t ted model is simply a ~ilult~iple r e p sin11 rnodcl for 365 o k m t i o n s which are a s ~ u r n d ttu bc i r~d~pender of oric a~~ot~her; the estimiltcd srde parameter is jmt tlic residual me= sguarc, and thc davimcc is equal lo the residrld slun of quww. P- ~stirnnted regraion coefficients ant1 their msodated standard erm- indicate that the group by vissit irltcrartiorl is rignificant at thc 5 " Icvel. Hmc~ver, treating the observations m b~dcpendent is unrealisr - and will almost wrtainly lead Lo poor =timat& or the standard error:. Stmdard errors for hetween-s~thject factors (hcrs group) WP likel~ i -

he undc~~estmhcd h a c a ~ w rw are treating observations From t I c s a r snhjecl rls independent, t l ~ ~ ? . increasing the apparent sample size; s t a - d a d m r s for wiChin-~~~bject f&)n (here vlsit, p - v i s , and via: arc likely to be mresli~naterl since n~ me not colltrolli~ig for residuz betwccn-suhjoct variability.

Uk therefore auw abandon the ~ssurnptinrl uf indaper~dcnce m: =timate tl corrrlatio~) mntrix having compoilnd symmetry (i.c., mr- straining thc correlations he tm~rn t,he observations at any pair of tip-

Page 212: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

:iE population-averaged model h b e r of obs - 366 3rmp variable: subj M b e r of groups - 61 :mk: ~dentity Obs par group: min = 2 Family : Gausalan BY& = 5.8 :orrelation: iadepsndent max - 7

Wald chi2(4) = 269.51 Icale parameter: 26 89995 Prob > cbI2 = 0 . 0 0 ~

)tar80~ ch12(356): 9578.17 Deviance n 9576.17 ~ s p s r s i o n (Pearaon) . 26.89995 Dispersion - 26.89935

.390a383 ,579783 4.89 Q.000 .2340665 .5468102

?oints tu be cqual). Such a correlalion strulctume is spccificd using :orr(exchangeable), ol' the rrhhreviatcd form corr(exc). Thc model -an he fitted as Ibllows:

xtgee dep group v i s i t vsis vis2, i(subj) t ( v l s i t ) / / / corr(exc) linkciden) f amcgauss)

Inbtead of speci%ing the sub.joct and time identifiers wing the op- -:ms i 0 mid t 0 , we can also dcclnre the data as bcing of the f m n r t (for cross-ser:tior~al timc scries) ~q follows:

i i s subj ti$ v i s l t

-.:id ornil the i 0 and t 0 options from riow on. Since both tlw link =.:id thc fnmily corre~pord l o the ddauIt options. the saInc analysis y a y be carricd out, using the shorter corrrnl~id

xtgee dep group v i s i t gr-vis v1s2, corr(exc)

4 c Display 10.2). Aftcr ~Lirnation, estat wcerrelation rcports the ~ t i i i ~ n t ~ d uq~rkirig "n i l,lliuiS conelation matrix

estat wcorrelat ioa, format lZ6.4g)

rhich is shown in Ilisplay 1U.3. Bcre t h e format0 option was wed to :--<lucc the niirnba or clecimrtl p h s and therefore avoid vm7s of the .cllrix wrapping over two lines.

Sote t,ha+, the standn.d error fur group has jncrewcd wtlcreas t h ' , r v i s i t , g i v i a , ald v i s 2 have dccremd as expcctcd. TIE est,i- - a t 4 wit.hi11-subjcct correlatiur~ rnalrix is corripound symrnet.rit:. This

Page 213: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf
Page 214: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

s t ructu~e is fi.eq11ently not acceptable ~ i n c ~ r:nrrel..littions b ~ t w e m pi im of obmrmtior~s widdv separated in linw will oftcn be lo\vcr than ror obscrvatioris cloec~ together. TIlis pallcrrl was nppnrcnt from Lhc scab tcrplot rnat.rix gix>c.cn in Chaptcr 8.

To allorv Tor sue11 n pxilwn of corrctations among the repeat4 o h se~vatiorls. we cau ~nove to raatmgmaswe ~Lmck~m. FOP exnmglc, in a first-order autoregrewive sp~cifiration t11.e (:orreIatiorl betwerr~ t h e points r m d s is assulncd lo bc p l ' - y l . The necessary ~nstruction for fitting the prcviouslv considc~cd rnodcl bill with this first-ardor autoro :rcssir.e structurc for t t~c correlations is

xtgee dep group v i ~ i t gr-vis vis2, corrlarl)

S2E population-averaged model W b e r of obs = 356 ;roup and time vars: subj vi~it Mumbsr of grwpa - 61 Lank: ~dentity Obr per group. =in = 2 ?m11y: Gaussian w g ' 5 .8 :3rrelarion: AR(1) m u = 7

Wald cbl2C41 = 213.85 ? t a l e pramster: 27.10X8 Prob > chi2 - 0.0000

dep I Cosf. Std. Err. a P z l t l I9SX Conf. Ineemall

- -

Display 10.4

The estirnatm of the r e g w i o r ~ cocfficiclii,~ and tI~cir standard ccrrors Ir 13isplay 10.4 have changd hilt riot substar~tially. The rslirrlntcd i t l~ in-subjec t correlation matrix may again he obttained using

estat ucorrelation, format(X6.4g)

w Disl~lriy 10.5) w l ~ i d ~ has the expected pattern in which currela- - ,ns dccrcasc suhtanlidly ns thc separation between the oh-tionti - ,. -r rekw*.

Other wrrelat~on st.rrudm~s are available for xtgee, including the -?:ion correlation(unstructured~ in which rio coristrair~ts are placed

t11e mrrehtions. (This is esseutiallv qulvaler~t to ~uultivariate anal- -:< of variance fur longitudinal data, cxccpt that Ihc wrin~lcc is =- --ad t.0 he constant o r w time.) It might appear t h t ~ ~ s i n g this option

Page 215: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

210 r A Handbook of Stalkvrad A n d k c 5 Usi~ay .Stah -- --

Patmated within-subj correlat~oa matrix R:

Display 10.5

would he t l~c xmt sensible one to chouss for all data set.8. This is not. however, the rax since i t necessitatw the esti~ntimntion of mauy nuisance paramet,rrs. This can, in some cireun~stmccs, rmsc prnblcms in the estimation of thwe pitramcks nf most inter&, particularly when the sanlplc size is small and the number of timP points in Tnrge.

We now analyze the epilepsy dhta using a similar model as for tk depression rhhn, brlt using thc Poisson distribution and lug link. 7 3 data arc amiiabb in a Stattta Rlc epil .dta and can be rcatI using 1

use ep f l , clear

LVc wiIl treat the hiweline rnaasurc EL? one of the rrspunsm:

generate yO = baseline 1 Sonit! ilsoful summy statistics c m b p obtnined using I

summarize yO y l y2 y3 y4 ~f treat==l

(see Displays 10.6 and 10.7). We see that the number ool ohwmtions is co~rstant over time T

them appears to be no rlrupout. The rncaus nnd standard deviati- of yo arc larger than for the nthcr responsw because seixur~q SF= counted over m %week period at baseline md orrev 2-week periods c the neyllhscquent visits. Tho largest ~ I I P oC y l in thc progabida seems out of step with h e other rriaxirnum v~lues arid mgy indicate z z 011t1ier. Snmc graphics of thc data mav be uscctil for investigating 1%

possibility further: but f ist it is rnnvcniant lo reshape the data frcm its present "wide" form tr) t,llc "long': Form. Wc riuw reshape the as follows:

Page 216: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Grnemlrs~rl P~stmtnnlmg Epunlaonu F~iiqrltc .Sczm~w und Chprraothmzpp W 211 ------------------- -

Variable

8.580845 18.24067 102 8.419955 11.85986 A5

31 8.129D32 13.89421 72 Y4 31 6.709671 11.26408 63

reahape long y , i (subj) j (visiz) Sort subj treat visit list in 1/12, clean

:ubj visit id y traat basellnu age 1 0 104 11 0 11 31 1 1 104 6 0 1 31 1 2 104 3 0 H 31 1 3 1 0 4 3 0 l i 91 1 4 104 3 0 11 31 2 a iw i l o li 2.0 2 1 106 3 0 11 30 a 2 106 s o 11 30 2 3 108 3 D 11 30 2 P 106 3 0 11 30 3 0 107 6 0 6 25 3 i ior 2 a a 2s

Display 30.8

Pcrhaps the rnml useful grap11ic.d display for invmtigating the data is a set of graphs uf indiv~dltd resporlse prufil~s. Since we are pjmnirig

Page 217: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

to fit a Poisson nlodcl with the log link to the data, we takc thc log transformation I~efore plotting the response profilm. (We need to add a p ~ i t i v e lumber, say 1, hccausc some v i m ~ r e I : O U I I ~ ~ we zero.)

generate ly = log(y+l)

Howcvcr, thc bnsclinr mensure rpprpmnts yeizurc counts over an &week period, comparcd with 2-week periods for eadl of the olhcr time points. We tli~xefora d i a h the b l i ~ l c count by 4:

replace ly = log(y/4+1) i f visit-0

md t h ~ plot thc log-count,~:

twoway connect ly v i s i t i f treat--0, bycsubj, /// styleIcmpact) pizleCMLog count")

twoway connect ly v i s i t if treat==l, bycsubj, /// stylslccmpact)) yzitle("Log comt")

Ttie resulting graphs are s h m in Fignres 10.1 a i d 10.2. There is n- obviom improvcmcnt in the progtbide muup. Subjccl 49 hrul mo- epileptic fits overall t.hm any other subject and might perhaps be m:- sidcrcd an outlier (.we Exercise 10.2).

Page 218: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

v k l

Figure 10.2: Rcsyr>lxse profiles in ttlc trcatnd group.

-4s discuswl in Clinpt~r 7. thc most plausibIe distribution for count ?acs ir: oRrn the Porswn rlistribut~nn The Paisbon distribution is spec- - ~ d in xtgee lnoclds tlsir~g the npt~on f amlly(poisson) Tha log link - impllcrl (since it is the cm~lnlral link). The b d i n r counts were Ltdincd ovcl rn X-w~ck period whereas all suhsquent munts arc over

r n ~ c k To model the wiznre uatr in courits per week, R P must - -erektr~ IISP the log ohs-ervatian period log(p,) as nn u%et (a eovrtri<ttr ::th reg~msion ~~eRcicient set to 1). Tlw n1ude2 for thc mnean count 0,~

;:# r h a t the r ~ t e is modeled as

-. - .\c cm compute the required offset ming

I generate Lnobs - cond(vxsit==O ,kn(8) ,In(2))

T > I ~ O W I I J ~ Digglc ef al. (2002). we will allow the log rntc to &mge 1 -reatinant pm~rpsprcilc mnstrmtt after t lu bas~liiinc r~s~ssrncnt. Phc --esqarp cm'ariats. ail iriclicator for the post-hascline tisits and MI

Page 219: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

interaction betweerr that indicmr and treatment group, are created using

generate post = visit>O generate tr-post = trsat*post

We wilI also nuntrul for thc agc of the patients. The summary tables for thc scixure data given w page 210 provide strong smpi~icnl evidence tha t there is ovcrdispersiun (the w i a ~ c e s are greater *.tian the mcans). and this rAn be incorporntcd using the scale(x2) option to allow the dispersion parmeter 4 t o bc csli~nated (see also Chapter 7).

iis subj xtgse y age treat post tr-post, corr(exc) ZamilyIpoia) ///

off set(lnobs) scale(x2)

GEE ppulatada-eusrqed m d e l I d e r of oba - 296 Group variable' mbj Number of grwps = 59 Link. l o g Obs psr @oup: min - 5 Famlly: Paisaon aVg - 5 .0 Corralation: exchangeable max - 6

w a d caia(r) - 5.43 Scale parameter: 18.48008 &ah > chi2 - 0.2458

-.0322513 ,0148614 -2.17 0.Om -.061385 -.0031176 treat -.0i7737B -201945 -0.09 0.930 -.a135922 ,3780176

.ll07981 .ISOW35 0.74 0 460 -.I83321 ,4049173 tr-poat -.1036807 ,213317 -0.49 0.627 -.57.17742 ,31441s

_cons 2.265255 ,4400816 5.16 0.000 1.402711 3.12TS lnobs Iof fret)

(Standard error. scaled using aquara root of Pearson X2-based B ~ s p e r s ~ o o l

Display 10.9

Thc ou tpnt assllming WI exchangeable correlation slructnre is gi~~"- in Display 10.9, and thc cstimatwl rorrelatiu~l rr~alrix is ohtained usitc xtcorr.

(see Display 110.10). Tn Display 10.9, the parameter 4 is evtirmtcd a? 18.5, indicatk

WVP~P ov~rdinpersiii~u i11 Lhex data. We briefly illu~trat~e how import%- i t m y to dlw for ovcrdispersion by omitting Ihe scale(x2) optior.:

Page 220: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Fstmated w l t h l r s u b j correlation matrix A :

Display 10.10

xtgee y age treat post tr-post, corr(exc) family(pois) /// off set(lnobs)

GEE populmt~oo-averaged model Number o f O ~ B - 295 Group variable. aubj Number of groups - 59 link. log Obs p e r group min - 5 ?amlly: Poisson avs - 6.0 Correlat3on: exchangeable m a x ' 6

Wald df214) = 100.38 Scale parwmster: 1. Pxob > cba2 = I) 0000

Display 10.1 1

Thc rcsulcs givc11 i n Display 10.1 1 show that the stmdard errws me ror~ murh smaller than before. Ewn if werdisperuion had nut been saw ~er.twI, this error co11Id have h e w detected l r , using the vce(robust1 .~ption (.we Chapter 7):

xtges y age treat post tr-post, corr(exc) iamily(pois) /// o f f set Ilnobs) vce(robust)

Tlic results of thc rohust rcgrrssion in Display 10.12 are r~markably - n~i lnr to thosc of thc ovrr.dispersed Poimison modpi. st~ggwting tlxit. -7 .p lntlcr IS a ~caso~iable .'lnode1" for the data.

Page 221: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

w t x a a t post

Cr-DOSt

GEE populatlm-amrag& model flumber of obs = 295 Group variable. subj Nmber of group8 = 59 Link: log Obs per group: m i n = 5 P5IUlly: P O ~ S ~ O E W g ' 5.0 Co~rePat iw: exchangeable max = 5

Wald chi2(4) - %.B5 Scale parameter: 1 Pro8 > chi2 = 0.1442

(Std. Err. adjusted f o r clwtsring on smbj)

Display 10.12

Y

The estimated coefficient of rr-post represcnh the estimated dif- fercncc in the c h m g ~ in log seivurc rate from bnsclinc to post r a n domizati~n bctwcen the plareho and progahide groiips. In thc placebr group thwe itr or1 incrcasc in t,he log seizurc rate of O.ll(lX, nntl in the prngabide goup thcrc is an inmaue of only 0.007 (= 0.11118 - .103il. Hmwer, t11e di ffcrence is not ~igriificai~t (p = 0.68). The exponential a! the interaction coeffiri~nt giw an estimated incidence rate ratio, hem the ratio of the rclatiw increase in seizlrrc rate for the treated pahienrs cornpard with the cor~tml patients. The rxl>oncntirtted corffic:ier~t aui_ the corresponding nnlfi(lence in tcml can he obtained directly ilsinp thc ef o m option in xtgee:

Semi-robust Coef. Std. F.rr. z P>lal 195X Ccnf . ~ n t e r v a l l

xtgea y age treat post tr-post, corr(exc) /// family (poi.) off s e t (lnobs) scale(x2) e f o m

The r ~ s t ~ l t s in DispIny 110.13 indicate that the relative increase k scizurc rate is 10% lunw irl thc trcatetl guuy compared with the contmr group, with a 95% cullfidcncc interval frorrl41X lower t.o 37% sate.

Ijowwer, before inLc~.gre~ing theue est i maws, we shuul~i perfor- some d i a p ~ ~ t i c s . Standardizarl Pcnrson rmiduals can bc useful fir ident,ifying potcntiat olitliers (we equation (7.9)). Tliesc can be founi by first llhiug the predict cnmnland to o b t ~ i n predicted munts, su-- trncting the o h e n d counts, and dividirg by the estimat~txd s t n n d a ~

deviation G, where 8 is thc -timated dispcrsioa parameter:

quietly xtgee y treat baseline age viait, corr(exc) / / /

Page 222: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

GEE populatrcm-averaged m & l Ihrmber of oh6 = 285 Group u e r ~ a b l e : subj Humber of p~oups - 59 Link. log G%s pmr paup. min = 5 Family : Polrsnn avg - 5.0 Correlation: sxchmgeable max = 6

Vald chla(4) - 5.43 Scale parameter: 18.43008 R o b > chi2 - 0.2458

age .Q6SZ632 ,0143927 -2.17 0.030 .9404613 .W4&9873 treat 98237 .1983847 -0.04 0.930 .6612706 3.4593B9 paat 1.117168 3676464 0.74 0.460 .8325009 1.499179

tr-post .9016131 ,192308 -0.49 0.627 ,5934667 1 369456 lnobr (o f f set>

(Stan- arrors scaled uslng a p r a root of Pearson Ka-based diaperslon)

family(pois) scaleIx2) predict pred, mu generate pres = (y-pred)/sq?A (e(chi2_dis)*pred)

Boxp1c)ts of thrse resid~~als at endl vlsit are obtaincd rwinl:

sor t visit graph box stpres, medtype(line1 ovet(visit, ///

relabel(1 "visit 1" 2 "visit 2" 3 "visit 3" /// 4 " v i s i t 4"))

The resulting graph is sllown in Figure 10.3. Pearson residuals grcnter illan 4 are ccr t in ly a camp for concern, so we Can check which &i~bjects -hey belong to r~sing

list subj id if stpres>4

I subj ~d I

iubjcct 49 appears to he an outlier due to rxtrcrriely large mcnnnts ar .ww in Fiplrc 10.2. SuRject 25 also has an i~~itl.lmuaually lwgc count at, -.+it 3. It woulrl be a good idcn to repeat, lhe analysis without ni~bject :!I to sce how much the I T ' S I ~ ~ P I are affmtd by this unusud wll>ject, (sce :sercisc 10.2). This [:an be vjebveved as a semit i~xtg andvbi~. I -

Page 223: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

ID -

P - S z

N -

a -

N.

m

i rn * I &&&€

0 1 2 3 4

Figurc 10.3: St~ildnrdiicd Pcmun residuals.

10.4 Exercises

10.1 - Treatment of post-natal depression

1. For the depression data, rompare the rcsults of GEE wirk a compouttd spimefric structure with orlfinxy linear IF gression where standart! crrors ace corrwted for the withi* subject corrclatior~ using: a. the options, vceIrobust) cluster(aubj), tu obtain tk

sandwich estimator for chrst~red data (sw hclp for regre- 4 wti

b. bootstrappirg, hy ampling mbjecb~vith rmplnremcnt. T b i ]nay he achlewd 11s1ng G ~ c bootstrap prefix, t o # c t k with thc option cluster Csubj 1.

10.2 Epibptic sei~ures and chemotherapy

1. Explorc othrr possiblc correlation str~lrtilres for Ihe s e b data in t.he eontcxt of a Priswn rnodcl. Exnsnixle the rob-- b-tandmd ~ r 1 . o ~ ~ in ewh c m .

2. Repeat the above a~mlyses, but excluding subject 49 (a% a p p m to he a11 oullicr). Complirp the rasulls.

Page 224: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

10.3 Thought disorder and schizophrenia

1. f i r Ihc Lhought disorder data dismEserl i11 the previous dlag ter, estimate the effect of early, month their interartion on thc logit of though:hl disorder i~sing GEE uith an exchmg+ ablc corrclntion structure. USP rohwt standard errors.

2. Interpret the estimates. 3. Plot tl~epredicied probability over time for early o m . t t uvman

(using graph twoway function, see Section R.3.2). and com- yare the curve with thc curves in Fignre 9.10.

10.4 Driver education

In a randoinized experiment, to iuvestigate if drivcr education re- duces tlic numhcr ol oollisionq and trfiffic. violatiurn of teer~qvrs (Stork et n l . 1983). aligilde h i ~ h 5sch001 students were random- i z d to thrw gmilps: safe perfumsnw curriculum (SPC), p w drivcr license curriculnm (PDL), ar~d cor~trol. \t%crca Lhc SPC w~ a 70-hoilr state of-the-art program; the PDL wxs a 30-hoisr cour>w oontaining only the minimum training required to pms thc driving test. The control g r r ~ i ~ p receil~ed no training Lhrough the sclwol system and w ~ 5 taught by t,he parerits mcl/or private training schools only. During thrcc yeaw of follow-lip, the otr rurrence of collirionv and moving violat~ons were nht.ainer1 \]sing r ~ o r d s from the stC1tc Department, of hlntor Vehicles. (The data are fro~n Davis, 2002.)

The wia1)Ies in drivers-dta are:

program: group (strlugs w i a b l c with valucs SPC, PDF, and Conbrol)

m gender: gmder (string rwiahlc with v a l ~ ~ ~ s Male and Fe~nnlt?) I colt to co13: i11dica;tor For al lexst one mllihion or moving

vlolatirm dnririg years 1 Lo 3 I num: number of times thc rcsponsccovariate pattern ocr~nerl

I. Prepare the data for mdj~is using GEE. [Hint: make sure to expand the data first using expand num, then rmhape t r l long.)

2. Investigate the effect or lime, program, gender, and the pro- grtvn by gender irltcraclio~i on the odds of at least oue collision or moving vinlat,ion using gcncralizcd eatilnxting equations wilh n, logii, link and unst,ructured corrclntio~is. Usc robust starrdord el rors t.11roughout this exercise.

3. Pcrforrri Q \C7a1d Ccst for the interaction tcrms and rcmove thcm it Ihc tcst is not significant at ttrc 5% 3cvcT.

Page 225: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

4. Ry inspecting the elirnatd wrrelittioii matrix, chom tb correlat~on st,ructme that mppem to be mwt appropriate wtimat~ t.he model with that correhtion struct,wc.

5. Interpret tlic odds ratio entimatw for thc firial n~oclel.

Page 226: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter I 1 - -- - -

Some Epidemiology

11.3. Description of data Tliis chapter illustrates analysis nf rkiiFmcnt ~?pidclniological desigr~q~ rv+rr~cly cohort slrldies md matchcd as well as unniat.ched case-cr>nt.roI -turlies. Fonr datwct,s will be used which arc pr~sent~c11 ih, the for111 i~f cross-talthulalions in Tables 11.1 to 11.4, (Tables 11.1 u~rl 11.4 arr, : a h from Clayton and tlills (1993) with permission uf their p~tbIlsher, Chforrl University Prw5.j

The data in Table 11 .l result from a cohort shdg whirh immtigatrtd -he rcl~tir~riship bet,ween tliat and ischemic hcart d i s e m (IHU). Hcrc we: onsider low snerKv intake a.5 a risk factor since it is highly .associated

xith lack of physical rxcrcise. The tahle giws hqnencies of If113 by *+n-year age-bnrrcl and expasare t.u i-i high or luw calorie dict. The total -i+rsnu-wnirj of ohscrvntion me dm p;ivel~ for cach ~ ~ 1 1 .

Thc d n t ~ ~ ~ t in Table 11.2 is the m1dt of II. C ~ S C - C O T L ~ T U C I I siz~rdy in- ..-+rignt,ir~g whethcr keeping n, pet bird is n risk factor for luug cancer. - .!,is rlntaset i~ given in H m d et al. (1994).

The datat;cts jn Thhlm 11.3 end 11.4 arc from ~rsa.tchedcm-control - - i t ~ L i ~ s , t11c first, with a single r ru~ td~cd control ~rnd the S W O ~ with - ::rw rrlntched co~zt~als. Table 11.3 ~ i s e s from a rnubcl>ad case-control --7.1dy of cntlomct,rial cancer where casts nwc ~natchcd on we. race, : irr: of admission. rtrd 11ospital of admission to a sui t~nbl~ control not --:.Cii.ririg hm cancer. Pnst mgrmurc t o rnnj~tga.ted estrogens was: dc- ---rlninerl. The dnt~scr is described in Everitt (1994). Finalty. t,hc

:-a in Table 11.4: described in Clnyt.on arid Hills (1093): nrise rrom a ---=e-coulrol study of hrcmt cnncer scrcming. 1Vnme11 nrhr) Imd dim1 of -.*;\st cancer were matchcd ~ v i t h three control womca. The srscming

Page 227: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

history or each coniml was assessd ovcr thc period up to the time & diagnosis of the 111atcher1 CAW.

11.2 Introduction to epidemiology Epiderniulogy can be tlsscrihml as tth study of diwMt3s in pop~latiow. in particular the march for causes of disease. For ethical reamns, aub jccts carmot bc randomi~ed to possible risk factors in order to establis? whether these nre associated with an increase in the incidmtx! of d i ease, and therefore epidennioloq is based on observational stuldies. Th* most ir~iportant twes of studies in epidexnio1og-y are ~uhlurt studies case-car~tml ~fudiw. We will give a wry brief descriptiun of the d w i g and analysis of these t,wo types of studis, follou*ing closely tllc cxpI+ nations and nolalion given in the excellent book. StaCnstacaI Mudeis r Epademtology, by Clayton m d IIllls (1993).

I Cohort studies

In x c:ohort stildy, a group of subjects free of the disease is follo~wd up m d the presence of risk lxtors as wcll as the occusrcnce of Ihc disa- or intcrcut a ~ c rocorclerl. Tlris desigy~ i s iUustrated in Pi~wre 11.1. -4:

Cohort. free of disease

2000

1900 without

Now - Future

Figure 11.1: Cohort, study.

Page 228: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Tablo 11.1 Number of IHD cases and person-years of observation by age and

exposure to low energy diet

Cxposerl Ilurxptwcd < 2750 k d 2 2750 kcal

Table 11'2 Number of lung cancer msea and controls who

keep a pet bird

Kepl pcl l ~ i ~ r l ~ ; C w s Cunlmls Yrr $18 101 No 141 328 'Ibtnl 23Y 429

Table 1 l . S Frequency of cxpmure to oral coqiug~ted

estrogens among cases of endometrial cancer and their

m~tched controls

Table 11.4 Screening history in subjects who died of breast

cancer and 3 matched controEs

Status of thc CMC O 1 2 3 Sclwrrml 1 4 3 1 I?nsrrenod 11 10 12 4

Page 229: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

exampie of a cohort study is the stutly d ~ m i b c d ill thc previous ~ctior: where snbjects were f ~ l l o w ~ d IIP t o monitor the orcurrcr~cc of iucl~emic hoitrt (Iiseasc in two risk groi~ps, tl~rwe with high and low energy illtake. giving the rptii~Jts in Tablc 11 .I.

T11e i~~cidencc rate of the disease h may be wtirnabd by the num- hcr of ncw cthres of thc disease D during a time intcrml d iv idd b the paon-time uf ohsesvation Y , the s r ~ m of alI ~ i l b j ~ c t s ' periods Q!

observation (during thc time irlttpml:

This is thc maximum hkclihood ~ t ~ i r n a t u r of X a~urning that D fo!- 10% a Poisson distribut.ion (izldcpwldanl evcr~ts occurrirg ,at a cunstan- probabilrty rate in contir~~lona tirnc) with menn XY, n4ierc Y is treat& as Lkd.

Thc mast important quantity ol' i~ l t c r~s t in a cohort study is t b iwadence mte m k o (or rclative risk), thc ratio XIIXo uf incidence rate for those cxposed to s risk factor and thosc not ~xposcd to the risk f~ctor ( s~~bsmip t ,~ 1 and 0 denote exposed and r~nexpcs~d, respectively The incidence rat? ratio may bc estimated by

This estimator can bc derived l>y maximizing thc cr>ndit,ion~l (binomial likclihorxl that tthelr wrre D, c w s in the expas~tl g ~ o u y oonditiong on there. being a t ~ t a l d D = Do + Dl rasm.

H~nvover. a potential problem ia cvtimating Illis ratc ratio is mr- fotonnding arising from b7wtmntic diffcre~~ces in pcognustic fn~fors bp ~ U T ' P P I ~ tllc exposure groups. Tllirj probli?m can be dcalt r r i f , h hy dividicr thc coh~mt iidu gruiips ol strata uroording to pvoolostic factors and a= xurnir~g that, the r d c rntio for ccxpov~d and uiiexpos~tl subjects is tk samc *=rum xtmta. If there are Dla caws and Y" person-ycnm of o?- servation in st,ratum s, then the ccomnlon ratc raiic) r r l q be ~ r j t i m m ~ using the method of M a ~ t c l ar~d Hael~szel by

Notme lhat t,hu strata might not c:orrespond to g-roups of suhjc~Ts. Fcr rx~nple, if the ctmtont~drr is ~g:, s~iubjeceu who cm% from onc ap bard into the next during ttle shzdq. r.ni~tributt= parts nrthcir periods r' oTar.l*vation to rllffercr~t stmla. This is llow Tahle 11.1 was constriicteL

Page 230: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Sonar I'&donatolqv B 225 --------------------- -

A more generd way of caritrolling for r:onfour~ding variables is t.0 IWC Poi~mn regression to model the n11rn1,cr of occurrcnrss of disease or "frtllures'*. Snch mt approach allnwfi i~tdusion of tieire~al cotariat.es. The curnplcxity or t i e model can be decid~d by nmodel wl~ctinr~ rriLe- ria. often 1 4 i n g to wrhootkir~g through tho omission of higlier order ~nteraciic~ns. If a log link is u w r l , the expa~ted number of f . l u ~ s car1 be ma& proportional to the pemn-years of obscrmt~on by adding Ihe log of the person-years of ohscmntion t,o the lin~ar predictvr as an o&t (MI explanntosy variahlc with regwsioxk cr)eKcieut wt t o 11, giving

ns required.

II the incidence ratme of s disease i s ~mal1, a cohort study wquirce: n large number of pc1.son-yeam of obmation rnakirig it very c~~ens ive . .A snore feea~bla type of study in this situation iq n caw-cwtrul stnrly in which cases of t tm disease of interwt fire (!omparod with ilon-cases, often called mntrots, w?t h rmpert to cxpoqure to possihlc risk factom in ill^ past. The baqic 1dc4 of cne-control htotlirn is shown in Figure 11.2. Thc a51nnptio11 IIWR is that the grobxbility of sclwtion into the s t ~ ~ d y is indcpmdent of the cxposurPs of intcwst. The data in Thblc 11.2 rlcrivc froin R ca+o~,ntrol study m whir11 CBSW with lung mncer and hcdthy cnritrols wcre intcrviewcd to ww~.tain whether they had bcw -mposed" to a pet, bird.

Lct D mcl H bc the nun~brx of c m s and controls, ~'mpwtiw1y, and kt the suhcripts O nnd 1 derlotr "umxpos~tl" and "cxpmed". Slnce the proportion uC ciws \vas dckrmined by the design, it rs nut possihle to estinmte the rehtivc risk of dimase enmnpmtflng exposed and nonexpmcd -uhjects. However, the odds of exposure in the c ~ e s or cunt,rols cnn bc ~atinmtecl, a d bhe ratio of t11-e odds is equal l o the odds ratio d

1 king a cwe in Lhe expmcd grclag compared with the unexpmed nolip

1j-p n>odcl the lo^) udrls of being a, cnse using logistic regrgre~ion with

Page 231: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Now

100 Cantroh 100 Cascs

I Sample 0.53%

1 Sample 100%

19900 1110

Population, 20000

F i v e 11.2: Cxse-control study.

thc exposure a5 an explanatory variable. Then the ~ ~ f f i c i ~ ~ t of the cxposurc variablc is a n mtimatc of the desired log odcl.5 rat-io eveE though the estimate of the odds (which d~psnds on the constant) B rletemined by t,he proportion of caws in the study. Lugistic: repessim is the most popular mdhod for estimating iulj~rbid odds ratios for r k k iactors of inlc~*csl alter controlling lor confounding vanablm.

I 1 ,2.3 Matched case-control studies

A rnajor dificulty with case-coutrol studics is to find suitable c o n t d who arc similar cnough l o thc cascs (so that differ~nc~s in exposw ean reasonably be osaurned to be duc to their association with th dise~w) w l t h o ~ ~ t being overmatched, wkiidr can result in vcry simils exposure pat,terns. The prohbm .of f i nd i~g ~ur~trols who arc sufficient:- sirnil= is often a d d r w d by matching controls individually to ca- according to important varinbbs such as agp and Sex. Exw~p1,les 6 such matched ca,+cot~lrol studics arc given in T a h l ~ 11.3 and 11.4. k the streerring study, matching had Llic following dditional d m n t e nt~ted in Clayton and Hills (1993). The scrccniw llistory of contmc cwld be doterminad 1-y con~idering onlj~ the period up to the r l i n $ n e ~ of the case, ensuring t h t caws did not have a d e c r e d opportun- for s c r ~ ~ n i n g betaurn they would nut haw been screened dter tk diagnwis

Page 232: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

- - Some E p l d m i o l q j . 227 -- -

'The stalistical ~ia lys i s has to take a r ~ o u n t of the matching. Two methods of analysis aw h~lcNcmar's test in the ~irnple case of 2 x 2 tables and conditional logistic- regression in thc case elf several controls pex case nnd/nr sweral explanatory variables. Since the c%s+cnntrol srts havc been mntchcd nn variables that are h e l i e d to be wsoci~itcd with ~liscnse status, tkrc sets can hc thought of ns strata with sr~bjects ill one stratum having highcr or lowcr odds nf being a ctlse than those iri auothc1' stratum after coritrolling for the eposurm. A lugistit regres- sion rrrodcl awlld ham to at:cornmodate these differericcs by indudinp; a parameter a, for each cme-control set c, SO that the log odds of being a mr;t: for subject z in cascrontrol set. r: would be

IIowwa, this would rmlt : iri too many pa~wnvtas to be estimated (thp incirlcntal parmictcr problem). RlrtIermor~, the parameters cy, are of no inter& to us.

In conditiund logistic regrcwion, thc nuisance parameters a, are ~limninated a~ ffollom. Jn a 1:l matched case-control study, ignoring t l ~ c fact that cach wt has one case, the probability that subject I in tlc scl is a case and athject 2 is H LIOIICBSF: is

anrl the p r ~ b a b t l i t ~ tlmt ssllbject 1 is a noncse ard subject 2 is n case '5

Howevcr, conditio~lztl on tthrc being onc case in n set, the pmhability nf w~hject 1 being thc c ~ e is simply

:~~ICP N, cancels out; soc eq~lxtion (11.1). The oxpression on the right- ?land sidp of muation (11.2) is the contrihi~tion of a si11glc case-rontrol +I i,o the conditional likclihnod or the xltrrlpjc. Sirnilmly, it ran bc -11own t,hat if there are k rnntrols per casc and the subjrtts within -=xh casecontrol E& are labeled I for the ewe m d 2 l o k t- 1 for t l~e

Page 233: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

controls than the log likclihoorl heco~nes I

I

11.3 Analysis using Stata

11.3.1 Cohort sstetdg

There is a coller:tion of instructions in Stnta, t.he epitab cornman&. that may be tisd t o analyze snlaU tzhlw in epidemiology. Thew corn- mmlrtr r i t h ~ r refer t,o mrial>lcs in an tixisting dntaset or can t,akc CPL count*? as nrgumcnts (i.c., thcy arc immediate wrnmarid~ set! C h a p t ~ 1).

The first cohort dataset in Table 11. I is given in ri file 8s ttshulatk nrld lnav he lead using

m f z l e str5 age numl pyl tlm0 pyO using ihd.dat, clear

Ilcre thc number of m s and person years l ~ v c bee11 named am1 aoc pyx in the cxposd group mcl numo and pyO in the U I I ~ X ~ O H P C I p u p - W? can stack t11e respomcs for both gro~lps u~tto variables num and nsing 61112 reshape cornn~nrid after gmd~rc~ng an idenfifia age- for thr rnwh in the data which cor:orrsporid to age groups.

generate agegr = -n reshape long num py, icagegr) j (exposed)

1~nori11g agegr, tllr incidence rat.c ratio rnhy be mtirnxted ming

ir num exposed py

giving the lable in Display 11.1. The incicknnc rate ratio of iric.hee hcart diseas~, comparing lw mwgy with hkh erleru irltakr, is by- meled as 2.48 with a 95% conficlence iritewal from 1.29 to 4.78. (Lo-? that we crn~ld report thc wciplocals of thwe figure if we wirhed - - consider high energy iintak~ ns thc risk rwfnr.) Thc terms (exact7 iniply that, the confidence intervals are exact (nu approximation used).

Conlrullji~g for age nsirig the epitab rnwnmuld

i r num exposed py, by(age)

(see Dibplay 11.2) givcq ver.v similar eslunatcs m shown in the m l xbdd M-H coab~nsd (the Mantel-IIacnszel estimate)

Amtlicr w y of corltrollirlg for aEe is t o carry out Puifisw r e g r e with the log of py F L ~ an oaset. The expunentidwl offset py mat- --*

Page 234: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf
Page 235: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

250 . A Ifandbook vf Statisldmi Attalyw~ U%ray S'tnfa - - - - .- - - - - - -

specified uming the expposure(py1 option. To obtai11 rxpoilentiatcd c:ocfficirnts, we usc the isr ("incidence rnte ratio") optim.

xi: poiaaon n u exposed i.age, exposure(py1 i x r

(ace Display 11.3) shnwit~g tha l t1.1er~ is an ~st in ia ted ageadjustd in-

- - ...

Display 11.3

i.ag~ -I*-1-3 t,Iage-1 for ap-40-49 omitted)

Polason regression Number ai nbe = 5 LR chi2(3> = 12.91 Prob > chi2 - 0.0W

Log likelihood - -11.888228 Pseudo R2 0.3516

cidencc rate ratio of 2.39 with a 95% rnnltdence i t i t e ~ d from 1.X to 4.36. The coefficirnts of and l a g e 3 s h o ~ that t h e inf- dc11ce inCren.ws with age (dtliough t t ~ c rate ratios for age groups L- not significant at the 5% lwcl) M would he cxpwtd. Anothm n-7 of achieving Ihe samc rt?snlt is rrsi~lg glm with thp Poisson distrihutio: and log link with llle log of peryon-y~aw of follo\v-up qpccifid as ir offset i ~ s i u q the off s e t 0 opt.ioi1

n u

W s e d _fag#-2 _Iage-3

py

generate Lpy = lncpy) xi: glm num exposed i .age, familyIpo~sson) linkclog) ///

affset(lpy1 eform

Here the ef o m option is nscd to obtain rxponentiatcd cueEcicnts (iz- ddcnw rate ratios intearl of their logarithms). Ail advantage of tb2= modeling approach is that we w~ ir~veqtignte the possibility of an i:- ternctiv~l between exposed aud age. If there w e mnrc age rrttegori*. 'i\w could att,empt to model thc s e c t of age as n smooth function.

IRR Std. E r r . z P>lzl [95X Conf . Tntsrvall

2.386096 ,7350226 2.82 0.005 I.SWMJ9 4.364108 1.137701 ,5408325 0.27 0.786 .U48115t 2.88916: 1.997803 ,9318379 1.60 0.134 ,8088976 4.935362

Isxpaaure)

1% w i l I arialgf~c the casccuntrol study usinc: the "i1nrnPciiat.e" cornmat*: cci. Tllc fulloxvir~g notalion i s t~scd for cci:

Page 236: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

S m e ~?pxpsdm~oZqy R 231 - ----

-

ahere the qvanlit~cs a: b! etc. in the tab]? are specified in alplral>ctir;zl ordcs, i.e.,

C C ~ la n d

(See help epitab br tla nrgurnerlts r q u i m l for othm imniedinf,e rpitrrb rnlnnlandu.) T1ie. bird data mar; therefor? be analyzed w h'ullows:

cci 98 141 101 328

yivirig Lhc otilpilt in Display 11.4. Il'lm odds ratio of lnng cnnocr, rorn-

chl2CL) = 22.37 Przchi2 - 0.000(1

Display 111.4

paring tliosc will1 pet birds with tlwse rvitlroul pel birds, b estimnted x< 2.26 wilh an ~ x n c t 95% r:nnfidcr~c:e intct-urtl frorir 1.58 tu 3 22. Thc .:cvdiic for thc nidl 11g~~ot~hw;is of no mucintior~ hctwccn pet birds and iung ~ A I ~ C P T is < 0.001. This p-va111c is bmcd on H. chi-sqtlarrd test.; an -swl p-vd.li~c could be obt~irled using the exact uplion.

11.5.3 Matched cese-eontorrl studies

The matched case-control study wit11 one control prr cmc may be c* 'r ~ e d using thc inrmdi-ate curnuland m c c i rvhich rcquiws roi~r riumb~~.s

to d dsfitled as

Caeeu Coaerols

Total

Wda ratio A t t r . irac. ax. dttr. frac. pop

Exposed Unexposed

98 141 101 328

199 469

Point estimate

2.267145 .5669824 .2283779

Total Exposed

2S9 0 4500 429 0.2354

618 0.2978

196% Conl. Interval]

i.580936 3.210758 (exact) .987463 .689321 (exact)

Page 237: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

which co~~&pondu to !,he liqvm~t of %irk 11.3. Thc required cam niand t.licrefnre is

Tlw results i rk Display 11.5 sngac?at. that lliere is an inmeawl odds of entlomelrial canccr in srll>jccts expowd 10 o rd co~~jugmled cstrngum (odds rnliu = 2.80, 95% m ~ ~ f i d r ~ ~ c e intwvnl fmlu 1.89 t~n 4.41).

Controls Casea Pxpomed Uneqmaed Total

ExpOaed L2 43 b5 h e x p o a d 7 L2i 128

T o t a l ] 19 164 1 183

McUemer'a chfZ(1) = 26.02 R o b > chi2 = 0.11000 Exact Mcllemar siguificanca probability = 0.0000 Proportion with fader

Caaea .8005464 ConProla .103$251 [96% Cmi. Interval3

difference ,1967213 .1210Q24 ,2723602 rarie 2.894737 1.885462 4.444269 ral. d i f f . .2196122 . 1 4 4 a 6 4 ~ .21)41695

odda ratio 8.142857 2.139772 16.18458 Isxact)

Thc rnat.ch~d cwcctrnlrol study with three tnnln>ln per cwr cnnno- I,e anabacd 1wir4 e p ~ t a b . Tr~stctwl, we will ~ i s c conditional logist:. r~pr.%ion. Wc nwd to convcrt Lhie data i n Table 11.4 illlo thc forr ~~~~~~~~cd for rfindit,iond lo~6stic regmqion; t-hat is, one uhrvation p~ s ~ ~ b i e r l (including ewes nud COIL~TOIT): at1 illdi~at~or variable: cancer for c&qcs; mother indicat,ot variable, screen, Ibr screening and a thir- variable. caseid, an identifiw fur c<wh case-mritrol set of four womer-

First, read dbe data which are i r ~ thc form shown in Tahlr 11.4.

I n f i l e vL-v4 using screen.dat, clear Than tra~lspnsc thc dat.3 m i l ~ t thr first column ul~llains frquenci- far i~nscrcned CSPS (vxiablc ncases0) and the secoitd for nrrcenf

Page 238: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

ct~sm (variable ncasesl). This can hc wkiievrd by first clcfini~~g t,he st.ring t-ariahllp -warname to contain the rcquircd variatjls nanmcs md then wing the xpose cummmd

generate -varname = cond(-n=l , "ncaseal" , "ncases0") xpose, clear ll6t

(sec Display 11.6).

Display 11 .B

The four raws in this t~ansposcd data& cc~rrespond l,o 0, 1, 2: d ~ t d 3 1natc11ecL c~ritroL4 U ~ I O 11nvc bccu screened. IVe will dcfi11e U.

:arittRle nconstr tl~king un thew four ~wl i~m. TVc Carl then stark thc! two col~trans illto a single vtariahkr ncaaea and crcaLc nrl indicator casescr for w h ~ t l ~ c r or not the case was X ~ C C I I C ~ using the reshape colnmand.

generate nconscr = 3 - 4 reshape long ncases, i(ncon6cr) ~(case scr ) 11St

;cc Display 11.7). Thc next step is to replicate each of the recordi =cases times so that we how one rpcnrrl per cnscco~itrol set. Then :ehnt: the varinblc cassid, and expand thc datasct four times in older

-1, have onc rccord per subject. The Four subjects ~ i t h i n each cusc <jntml set arc arbitrarily laheled O to 3 rn thc rnrii~ble control whew I 5tands for "the CRSC" m d 1, 2, m d 3 for t.hr conlrols.

sxpand ncases sort casescx nconscr generate caseid = -n expand 4 q u i e t l y by caseid, sort: generate control = -n-I list in 1/8

-PP Display 11.8). Now screen, the indicalor *%ether a si~bjtsct \r.m

Page 239: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Display 11.7

nconscr casescr ncases caseid control

0 0 I 1 1 0 0 0 li 1 I 0 0 11 1 2 0 0 11 1 3 0 0 ii a D

0 0 11 2 1 0 0 11 2 2 o a 11 2 3

Page 240: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

screened, is defmcd ~ L I be U except for the cmcs who were xcr*ncd and for t l ~ nmny mrltrols as were screened ~ecordirlg to nconscr. The variable cancer is 1 for rases arid 0 otherwise.

generate screen = 0 replace screen = 1 i f control==~&casescr==l /* the Case */ replace screen = 1 if control-=l&nconscr>O replace screen i ~f control--28tnconscr>l replace screen = 1 i f control-3hconscr>2 generate cancer = control=0

Wc ran reproduce Ta111e 11.4 by Lempo~*arily collaping t.he data (wi11g preserve and restore to rwcrt to the original d&t,a) as folluws:

preserve collapse (sum) screen (mean) casescr, by(caseid1 generate nconacr = screen - casescr tabulate casescr nconscr restore

(see Display 1 1.9).

nEouBCr

caaeacr

Total 12 14 16 5 46

Display 11.9

We me now ready t o u m y out conditiond logistic rc~res ion:

clogi t cancer screen, group(caseid1 or

(sce Display 11 .lo). Screening therefore seems to bc protective of &nth From breast caner, reducing thc ndds to about a. Oiird (95% corlfidcnce intcrwl from 0.13 to I1.69).

Page 241: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

238 W A Hnndbwk of .%tG'~kir~l A r ~ a l y , ~ ~ l h s g Slab - -

Condlt iwal (fnsd-effects) logistic regression Hmhr of obs - 184 LR chi2(1) I 9.18 Frob > chi2 - 0.0025

Log likelihood = -69.181616 Pseudo R2 = 0.0719

cancer I Qdda Rat io Std. Err. t P>!z[ [95% Coni. Interval1

11.4 Exercises

11.1 r Estrogens and endometrial cancer

1. Cm,v out. contlitiond Iogislic regression to estimate the odd ratio for Lhc data in Table 11.3. The data artre given in thl same form as in the table in a file caIlcd estrogen.dat.

11.2 Low energy diet and heart disease

I. For thc data ihd. dat, use the i r i command t u esti1ntlt.e th i~lcidcncc ratc rntio olTH13 comparing siibje<:tx with low %rrc

high energy diets without contrulling for age. 2. Usc Poissoll r~gwsion to test rvlirtl~er t l ~ c cffcct ol cxposm

to a low energy diet on incidence of IHD differs bcbccn ag groups.

11.3 Oral contraceptive use and myocardiai infarction

Manu et nl (1968) mdyzsd the data show1 in Table 11.5 whitl are also given in Ruthman (1Y86). Tllc data come from a ca- control study to invwtgrrte the effect of oral contraceptive E on ~nyocardral infarction. Cwrs arid controls arc nlso c k i E - bv agc group (< 30 and 2 40).

The variable in oral-dta are:

case: dummy vnrinblc For caw (mynrwtlial iufarctjuu) ve-7 TlOIlCASC

oral: durnrr~y variable ffo for1 contracmpf,iv~ 11s~ w age: age group (U=Age<40. l=Rgc240)

num: number of women with g i w i v.alucs of case, oral.

age 1. IEse the cc cornrrrn~d to cstimntc the odds ratio for mywh-

Page 242: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Tab10 11.6 Number of cases of rnyocurdh infarction and controls

who did and did not use oral contraceptives by age group

Yot~~rgcr Older b.~c<,lo Agc24U

dial itihrction ran~pnring l11o~r: who have s i~d haw not rls~d oral WIIL I~WC~~~VCY. A1m cstirnatp Ill+ ~ge-adji1~1.tsrl odd5 ratio using thr m e rornmor~d.

2. IIEs~liaa why the rlcljustod and 1111a(ij~~\ttCd odds ratins arc not lllr s m c hprc.

3. Now nsc logistic r e g ~ e ~ ~ i o u tn e s t i t ~ l ~ t c thc ~eud junted odds ralio.

11.4 Induced abortion and ectopic pregnancy

St1 x ixmtched cmc-mtltml tit ,~~dy. 18 worrlm wilh cetopic pwp u;dncir?s wrc individilxlly malcI~~ul according to K@, n t l 1 n 0 ~ of prcgrjancicss, ~ 1 1 d hunRo.u~l'fi educat-iorl with rour conlrols. All worrbcn h.d 11nrl at Irast nnc prcvinus prcgnant:y, and thc expo- sure ol intcresl is having had at one LXLCI~ICB~ ~tbolhtinn. TFlu data given irl Tablc 11.6 conw frotn Tricho~so~~lous ct o l (scc Mi- ettirlen, 1969) n l d t t ~ ~ previously m~alyzed hy R o t l l r n ~ t ~ (19883.

Tabb 11-6 Pattern of exposure ( to induced abortion) of 4

controls individually matrhcd to exposed and rlnuxposed cases of

cctopic prsgnmcy

Kunlhcr of expmcl aor~trolq

Stmns uf the rme 0 1 2 3 4 I;x,xl)racd 3 6 3 0 1

Page 243: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

The variables in ectopic .dta are:

III exposed: dumqy variable for ast! bci~ig exposed * numcon: numlxr of matched ountrols who were expmd w numgroups: nnmbw or rrmtchcd casc-control ~ 0 1 1 p s with a

given status of the ewe arid a given number of rnatclied con- trols who were exposed

1. Use randidion&! Iogilistic r~grasinn to ir~vestigate if there is an associatition bctwewi induced abortion and ectopic pregytlIlc~--

2. Interpret thc estimat~d odds ratio and confidcr~ce i n t m l .

Page 244: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 12

Survival Analysis: Retention of Heroin Addicts in Methadone Maintenance Treatment

12.1 Description of data - ! lrc claln to bc ranalywd in this rhapter art: on 131 heroin addicts in two ilfl'ererit clinics rewiving methadone mxinlcnancc trcattnwlt to help -.!crri overcome their adrliction. 'Early dmpoul is ax1 important problem - 1111 this treatment. LVeuifl therefore analyze t,he t,ime rrom nr~rnisbion -', tcrminalion of treutrnent (in days), $wen as t i m e in Table 12.1. !,>I. paticnts slill in treatrne~~t hen thew data I~P,TP cnllected, t i m e is - ,c lime fro111 admission to the timp of data collection. Thc variable rtatus is an indicat,or for whether ttme rcfcis to dropout (1) or end 'study (0). Possible explanalory varinblm For rctc1it.io11 in treatment :c rrlaximum metl~adans rlme and rt prison recorcl as wcll ns which of

-- o ~I i r~ ic s lllc addict was treated in. These ~ariahles are call4 dose. Trlson. and clinic, rsspm:tiwly. The data were first analy~cd h?,

~plel~orn and Bell (1991) and also appear in Hand ef al. (1994).

Table 12.1 Data in heroin. dat id clinic satus tjmc prisun do$? id c l i t ~ k sLaLu$ LLICW prLwr~ ~lmw, 1 1 I zpn 11 50 132 3 u 893 o ;n a I 1 2 7 ; 1 55 1.13 Y 1 nnr u 40 :I 1 1 2V2 11 3: 131 2 1 2.92 1 i O ,I 1 1 1 8 3 O 30 1x5 1! 1 13 I 60

239

Page 245: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

240 R A Handbook o j Stcrtisliud Amlg8e.5 I J k m j Stslu ------

Thble 12.1 Data in heroin.dat (aontinued) rd cllrlia rtutua tlnlr: pr~son du*r ~d clrrdt atatus tirur prlsw d w d 1 1 2 i B 1 6 5 137 2 0 5 G I O 70 6 1 1 i 14 o 5 158 2 0 HnQ n au 7 1 1 1 85 145 2 U l O j 2 U 80 8 1 0 716 1 60 144 2 0 V14 1 $0 9 1 I 8W! 0 50 I d 5 2 0 $81 U 50

9 U 703 1 40 2 1 1 6 1 0 40 '2 0 jh* 1 I'lJ 2 1 268 t 70 2 0 611 1 40 2 1 322 0 5.5 2 0 1078 1 XI) 2 0 2 1 4 n 2 U T R F i U 70

Page 246: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Table 12.1 Data in heroin. dat (continued) id rlirnic ctatu* t~nbc pri,u>n dwe i d cltnla. s t a c l l ~ L i r w prjsnn 11- 75 1 n eus u so 2x6 I t 19.1 1 70 76 1 1 2 3 7 fl 70 2 1 i i I 'I31 U 55 7 i 1 1 8 2 1 0 80 218 1 1 :1(17 0 4: TX 1 1 HZ1 1 75 219 1 1 3'18 1 fin 79 I u olr o 4s 'HO 1 o 18 r r GU sn 1 n wc 1 cn s ~ l 1 o say o do

The data can hc desrribcd as ~ ~ m w a l dntrs, dt1.1nugh the "endpoint? - not death in th i s c a e . but dropout from Ireatmenl h i n engi- -+ring applications, another rornmorily med term for the cndpolnt - fa~lute". D~~ratiun or survival data ran generally not he maly'cd - rnovent~onal m~t l iods such as linmr r c b ~ ~ s i o n . The main re,wn : r11rs is that some cliuatlons me i lsi ldly right-ccraarpd, that is. the

, - dpoir~t of iilt.crest has riot ~ccurrerl during the pcriod of observation - 1 a11 that is krluwn about the duration is that ~t CXMPT~S the nbscr- --in17 period. In the plTSpnt dataset. this applies l o dI obserwtioris

Page 247: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

where status is 0. Another reason why conventional linear repssion wv)uld not be appropriate i s that s~~rvival times tend to haw positiw1~- skcwd distributions. A third reaori i s that timevarying mwirttcs. such aq the time of year, could not be handled. In the next, scctirln, UY therefore describe mcthods specifically developed for survival data.

12.2 Survival analysis

12.2.1 Introduction

The survival tirue T nrr!y be regarded ns a rmdom variable with e pmbabiltty distribntio~l F ( t ) arid probability density function f (t). Ar obvio~rs quantity of irltt~resl is the probability of surviving to time t ar beyond, tht! sr~mivor function or survival curve S(t) , wllich is given

A further fi~nctinn which is of intercst £c)r survival dala is the Inmarl function. This represents the instmtaneo~m failure rate, that H, tb. prohrthility thal nn individ~lal cxpsriencm the cvcnt of interest at s I time point givcn that tile event liA not yet occurred. It, can he shw- that thc hmard function is givcn by

f lt) h(t) = - S(t1'

(12"

the inalar~taneous prohnbility of Mlure at lime t divided by lkie p m b bility ofsurvivi~g up to time t . Nolc that. thc hazard runct~on is just r% incidence rate discussed m ChspWr 11. It follows from cqu'ations (12.1 and ( 12 .2 ) lhat

- d log(S(t)] cdt

= !all),

S(L) = mp(-HIt)), ( 1 ~ 7

wherc H ( t ) is the integrated hazard function, dso known as the m m Iattve Imzard Junction.

so that

Page 248: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Tllc Kaplan-Mcier estimator is a nonpxmrnat~.ic estimator of the sur- vivor fi~ncbinrk S( t ) . If dl the failure time,< or time? at which the event occilrs in tlic axrt~ple, are ordered and labeled t( j l sudl that t l , ) 5 t(91 . . . 5 t(,), C ~ F : estimator is given by

rvllere dj is t t ~ c nu~rlber uf individilals who expcriarlcc tllc event at time :,,I, and n, is thc number of individuals who have not yet wpcrienccd i l ~ c event at that time and are therefore still "at risk" of eqxriencing it including Lhox c~mored at t(,)). The prod~lct is over all failure tirnes

iess than ur equal t.u t .

ilc can compare survival in different subgroups by plotting the Kaplarl- SIeicr estimators of lha groiip-specific sun7ivor fi~nctions and applying :inlplri! slgnificanre t a t s (~11th as the log-rank tesl). Hmrever. when -!ICTP are several explanatory var~ablw, and in part ic~~hr whcn sosome of -?CW arc continuous, it is mud1 rnore uvfi11 to use H. rcge*sion method -+ich as Cox regcssic~r~. Here the h a ~ . m l function for ir~clividud i i s ~ o d c l a d &?

I re ho(t) is the bmqelina Imznd fvrnclion. 0 arc regmsion rfieliicie~lls, c d xi c m ~ r i a b s . The baseline hazard ia the hnaarcl xvhcn all c m r i - i-.Y me zero, and this quantity is left unspecified, This nunparametric -~-atlne11t of the baseline hazlird combiner1 with a parmnctric repmen- -.-ion ol' t,11e cffects of comriates give r i . ~ to the term semipanam~tric =-:&I. Thc m ~ i n assumption of' t,he modcl is illat the haxard of any :.'iridud i is a t,ima-constant multiple of the hazard hc t io r i of m y -.-:?PI individual j, the factor being axp( (x i - xg)'/3), the har3:m.d ratio - kicidence mt.e rdio. This property is d 1 c d pmporfionaf hua7.d~ - ~ n ~ p t , i u n . The exponentiat~l resession coeficienta car1 therefolr! -& i r~ i~xprctd BS hazard ratios whcn the correspondir~g explanatory +.al)les inmeam bv one tunit if dell other c m r i ~ t e s remein constant,.

The par~ncters P arc estimated hy maxirnizirq: the partial Io,q like-

Page 249: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

where the first sl~mtuatio~l is w m d l failures ,f: and t,hc second sum- mation is ovcs all srthjccts r(f) who me still at nsk at the time o! fnjlirrc, thc "risk wt". It cm be shown Ihat ttliis log likelihood is r log profile likclihood ( i . ~ . . thc log of t.hc likelihood in which t h ~ b e line hazard pmmcters hwc been replawd by furlctionu of @ whid maxim~ze the Iikelihood for fixxed P). Note also that thc likerihood k eqitatiun (12 5) is equivalent to the likelihood for matched casc-contrc st,ndiw described in Cllapter 11 if t.he subjccts at risk at the time c' a Failure (the g'fsk set) arc rrgardted as controls matched to the c= falling at that puint in time (see Clayton and Hills, 1993)-

The bz~wl~ne b a r d funclion may bc t.stim~ted by rnludrnizillg t b full lug likelillood w ~ t h Ihe rcg~ssion parameters cvalunted fit t.he? estimated vuhles, giving nonmcu value- only tvllsn n failure occw. Megrat,ing the hazard function givm the cumulativc hazard functio~

XM = Ho(t) e x ~ ( + P ) , (12.F

where No(t) is the integral of &(t). The survivd curve may be otjtai- from H(t) using equation (12.3). Tlliv kads to the Kaplan-Meiw IF--

niator whcn thcre arc no cor~riutm. It follows fmm cq~~aliun (12.3) that the survival c m for 8 CF

model is givcr~ hy

S,(t) = S" (~)RXP(~~P) . (12.-

The log of the cumulative hazard funcLion predicted by the Cuu md- : is g i ~ e n by

log(H*(t)) = log l i o ( t> + x:O, (12.'

so that the log cumulativc h a r d functions of any two subjects i ar3J ; j are parallel with co~~stmit difference give11 by (x, - x,)'p.

Stratified Cux regression can bc uwd to r e h the assr~rnption -i proportional hazards for a categorical prdictor. The pmtial lihlihwc of a stratified Cox model hw t,hc samc form equat,iorl (12.5) ex- that thc risk set r ( f ) for emh failure is 11ow confincd to ~uhjccts in i j .

SRrIle sCrati.tllrn as the subject cont.ributing tn the numerator.

Page 250: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Survival analysis is dcscrihetl in Allison (1984), ClayLon and Hills I 1993), Cnllett (2003), md Klcin and Mocschberger (2003). Clevcs et nI. I 2004) disrnss sulviml arrdysis using S t a h

12.3 Analysis using Stata The data aTe ~~vjlnblc as an ASCII file call4 heroin. dat on the disk arco~npmying thud e l ol. (1994). Sin% thc data are stored in a tlm- colurrin hrrnrlt, wit11 the set of variahlw rqcated tm.cc in each row, as F ~ O W I I in Table 12.1, wc have to usc reshape to brirtg the dat,a into Ihc l~sual form:

inf i le id1 clinic1 status1 time1 prison1 close1 /// i d 2 c l in ic2 staCus2 time2 prison2 dose2 /// using beroin.da.t, clear

generate rowm-n reshape l ong i d c l i n i c status t i m e prison dose, ///

i (row) j (col) drop row col

Beforc fitt-ing a ~ y survivlal models, wc dmlxe the data as being uf ' -be fur111 st (for survival tirric) using the stset command 1 - stset time, failure(status)

failure event: status 1- 0 .8! status C :b5. t i m e interval: (0, tima] ixit on or before: failure

258 t o t a l ob%. 0 aXCluSiOn8

238 obs. remaining, rnpresenzlng 150 failures in siaglgls racord/s~ngle failure data

95812 total aoalyels t i m a at r18k, at risk from t " 0 earllnst observed antry t = 0

last obsemed exit t = 1076

stsum, by(clinic)

f a i l u r e -d: status nnalyais tale -t: tlle

--.- -rc

I 2

:oral

incidence no. of t---- ~ m ~ v a l time + t- at risk rate subjects 257, 50% 75%

59558 .002C484 163 192 428 652 36264 .00077223 75 280

95812 .OD15658 238 212 504 821

Page 251: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

There arc 238 sub.jocts in lotar with a median survival t h e of 504 ciaF If thc incidence rate {i.e., the hazard function) could 1)r assumed to Ix coIistant, it w~irld be esfimatd as 0 001(i per day (which currespond. to 11.57 pcr year). Owrdl, 25% of subjects remzirl in t.hr: clillic ad, leas 21 2 days, but t h i ~ difrers considerably by celinir (192 st~hjccts in clinic I ad 280 in clinic 2). In fact. m clinic 2, I s than SO%> of pmplc hat dropped out by thc end or follow up SJ tha t the median survival time and the time until 75% of .subjects Iiave dropprd out arr not given.

The Kaplan-Meiw wtirnat~r of thc survivor flun:t,ions for t.hc tm clir~ita wc ohtaincd and plotted ilsing

sts graph, by (c l in ic)

giving the graph in Figq~re 12.1. Dropout seems to ownr st a f:~sBr

Kaplan-tule~er survival estimates, by cllnic s

R d

5: 0

3 0

I--___--.

8 - 0 ,

0 200 400 600 800 1 a00 analys~s tlme

- cllnlc = 1 -- - - - dlnlc = 2

Figurc 12.1: Kaplan-Moicr survival curves.

in cl~nic 1. Arcording to Crrplehorn md Rcll (IWl), the more r a y i decline in t,l~c proportion remaining in clinic 1 cumpard with clix- 2: mdy be due to the policv of rlinic 1 tu attempt to limit the cIuratioz T!

m&i~ltanancc treatment to two years. To invwtigate Ih~e effects of dose and priaon on surv

use Cox regmsion. T;\k will &OW the h m a ~ d functinne clinirs Lo be non-proportional. A Cnx regression niodd wi strata can be estirnnlcd wing the stcox corilmrtnd ~vith the stratt

Page 252: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

stcox dose prison, stratalclinic)

gving the output shown in Display 12.1. Therefore, subjects with a

f a i l u r n _d: staturn analyeis tlre -t: t ime

stratlr ied con rep. -- Breslov methm3 for ties 30. a f subjects - 238 Number of obs - 236 !;a. of fal lures - 150 T i u s at rlsk - 95812

Lag likelihood -597.714

Stratif ied by clinic

-t

dose prison

Display 12.1

m. llatlo S t d . Err. z P>lel I O 5 % Conf. h t e r v a l ]

.9654665 ,0062418 -3.44 0.000 .953309 ,977-7 1.476192 ,2491827 2.30 0.021 1.059418 2.054198

arison history are 47.5% Inore likely to drop uut at any given time ~ V C I I that, they remained in t r~a tmmi I U ~ L I I that time) than thow

--ithout a prison history. For evcry increaw in mcth~lone dose by one ;nit. (1 ung), thc hazard is rr~ult~ipltad by 0.065. This cocRcicnt ig very lose to one, but lhis rnRy be bccause one unit uf'inethndon~ doseis not Inrp qr~iurtit~y. firen if ~ v c !mow l i t t l e about methadone main@nanr~

-rerttmcnt, xvp: tan awms h m much one unit, uf methadone dose is by =%rling tbc sa~arrlple stanclard deviation:

ndicxting that a unit is not. much at dl; subjects often differ from each -her 1.15' 10 bo 15 units. To find t t c hazard ratio of two subjects diffefi~lg

-x nnc standard deviation, we nwd to rdse the hazard ratio to th r -.~nm of o11c standard chiation, givirlp; 0.9654655'"'~'~ = O.GO1791R7 ..- i~? can obtdn the same r au l t (with greater precisinn) ~ s i n g the stored

I xacros -b [dose] E m the log hzr,ard ratio 2nd r War) for the variance, I

sunmrarize dose variable Obs Mean Std. Dsv. H i s Haa

display exp (-b ldosel *sqxt (r (Var) 1) ir176874

dose 2% 80.39916 14.45015 ao r 10

Page 253: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

I n the a b o v ~ r*nlculnlion, we simply rescalcd the reg~ssion cocfljeienr hrfore taking the exponential. To obtain this hazard ratio in the Cox re- gression, we need to standardix~ dose to havc unit: s ta r lhd clcviztion In t.he command hclow we. also shndar.rdi/c t.o mean zero, although t,hk will 1~1ake no differe~~w to the cv t~~nn lcd roafficiw~ts (only the haseliw hazards are aEccted):

egen zdose = s.td(dose)

51.7~ repeat Ihc Cox regmssiun d t h zdoee instead or dose:

stcox zdose prison, strata(c1inlc)

(see Display 12.2). The coefficient of zdose is idcnticnl to that calm-

lated prcvioirsly md may tyow bc int.~rpxetd as indicating a decreaw the hmard by 40% wwhw the mc~hadone dose incrms by one stand: deviation.

A~qilrning tbc variables prison and zdose satiufy thc proportic.: ha7nzrlx assumption (scc Swtion 12.3.1), we no\\+ yrcficnt the m graphically. To do t.his, we will plot the predicted surviv~I curveti w rat+ far the two dinics mid for those with mld withoixt a prison m-

zdose is eural~~ated at its clinic-specific mcnn. Such a graph r be protlirced by using stcox with ihe baseso oplion t.o genera:* variable ooniitirrir~g thc predictred bawline survival ffunclion and r: a~~pl~yirlg cqnatiun (12.7) lo ol>tain t.hr pl.cclict,ed tilrrr4yal li~nction. parl~culnr ~nvitri~btatc vnltics:

fa i lure _d; atatns analysis rime _t: s u e

Stratified Cox rep-. -- Breslow mathod f o r ties

In. of subjects = 338 Number of oba - !ES lo. of fallunea - 150 Time at rlsk - 95812

IR cbIa(2l . X . S Log l i k e l i h o d - -597.7i4 Frob > chi2 0 . W

stcox zdose prison, strata(c1inic) basesls)

-t

zdose prison

Haz. Rstia Std Em. z P>lal 195% Conf. I n t o r r ~ I

.60178%? ,0562195 -8.44 0. WO .501WB8 . 7 2 2 7 F 1.475192 .2491827 2.30 0.021 1.059418 2.054?3

Stratified by c lul:

Page 254: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

egen mdose = mean(zdose), by (clinic) generate sunr r s-exp(-b[adose3 * d o s e + -b[prisonl *prison)

Xotc that thcse snrvirrd filndiom represent prcdictd w l l u ~ a for siib- jects hamng the c~injosprdfic rnenn dose. tVe now transform time to rims in gem; ~ n d plot the s~irvivnl cwves scparatcly Cnr cach dinic:

generatm tt = time/365.25 Label variable tt "time in years" label define clin 1 "Clinlc 1" 2 "Cllnic 2" label values clinic c l l n

sort clinic time t ~ o w a y (line suxv tt if prison==O, connect(stairstep)) / I /

(line sum tt ~f prison==l, connect(stairstep) / / / lpatt Idash) 1, by(clinic) ykabel(Ol0.2) 1) / / / legend(otder(1 "Prison record" 2 "No prlsoa record"))

Hcre the connect(stairstep) option was uscd to produce the step shaped survival ccnrvm shown in F i r e 12.2.

0

0 I 2 3 0 1 2 3 nme In years

- Prison record - -- - - No prison record

ampa by dmtc

Figure 12.2: Survival curves.

The prrrtid likelihood i s appropriate for wntim~ouu sllni~d times, x-hi& should tl~eor~ticdly never take on rhc .%me d u e For any two

Page 255: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

individuals or he tied H m w w , in practice survivsl t i m s are measured in discrete imits, days in the present cxmplc, and ties will freq~lently ou:irr. For cxarple, subjects 61 and 252 in clinic 1 both dropped out afher 180 days. Ijy dcfnult, s tcox uses Bwdnw's ~ C U A U ~ for ties, where the risk sctr in (12.5) contain all subjeds who railcd at or after the fmlurt time of the subjcct contributing to the niirneralor. For a group uf riubjccts wit11 tied survival timcs, the contributions tu tohe partial l~kelihood thrzfore c d have tllc same d~norn~nntor. Howevcr, risk scts ~ w ~ ~ a l l y dccreusr by onc after each failure. In TLfmn's mettiod, con- tribiitions to the risk 5.1 from the subjects with tied failure timw are thpreforc downweighLd in successive risk sets. rn the mat t method (referred to as "exact marfind log likciihood" in [ST] stcox, the Stata reference rnmd for survival ailalysis), the coi~tributlon to tllc par- tid likelihood from a goup o l tied si~svival times is the sum, over al: possilrle ordwirgs (or perniutations) of thc t-i~ri survival lirrles, of the rontributiorw to the partial likelihood corrwpondi~~g to these ordcrinp. Efron's met,hod can be uht incd using the efron option and Ihe mar niet,hod using the exactm opt,io11 (see Exerrise 12.1).

12.3.1 Assessing the pmportional ham& assumptima

wc urn discuss methods for ~ s e s s i n g thc proportional hw~mds as crnmption. A graphical approach ia itvAabIa for catepricd prcdicto~ if thcrc are sufficient observations for wch d u e oC the predictor. k this case the modcl is first estinlatecl by stsatiCyirig on the c a l c g o r i ~ prdic:tor o l intermt, thus nut making any ass~lrnpt.ion scgmding th relat.ionship bet,wai the bas~linc hnzmds for differcnl valum of tb pr~d~ctor or strata. The log rumulative baseline hazards for the stra:; me then dcrived frnm chc estirn~t.4 model and plotted against time According lo cquatioa (12.8): the rmillt.ing carves shnllld Ilc paralle! 7 t.he proportional hazards assuruptim holds Hcre WQ d~monslrate t b r r ~ t h n d for the variable c l i n i c . The cumulative hawl~nc hazard a; be obtained using stcox with t,hs basechazard0 option as fullms:

quietly stcox zdose prison, stratalclinic) basecb(clr)

W e now compute and t-hcn plot the IogariLhrn of the curnuletive b w d i ~ hazard rtinctiun using

generate lh - log(ch) sort time tvouay (line lh time if cl inic==i , ccnnect(sta~rstep)) ///

(lme l h time if clinic==2, connect(6tairstep) /// l p a t t (dash)), xtitle("Tima in days") ///

Page 256: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

ytitle("Log cumulative hazard") /// legend(order(1 "Clinic 1" 2 "Cl in ic 2 " ) )

gving the graph shown i n Figure 12.3. Clearly, 'chc r.l~rvcs are not

C 0 m dm em 81x1 1 0 W

Time in days

Figure 12.3: Log o l minus the log nf the survival funct,iuns Tor the two clir~ics estimated by ~tralifid COX regression.

pardlcl, and IW will t,hhercfore continlre trcatin~ the clinics as strata. Sote that a quicker way of producing a similar graph would bc tc~ use T he stphplot command as fo11ows:

stphplot , strata(clin~c) adjust (zdoae prison) zero /// xlabsl(l/7)

Here thc adjust 0 optior~ specifies the covariates tc he uswl iri the Cox regression. and the zero optiurl specifies that t h e covariates are to be m l u a t d ht zero. As EL result: mimtt; the lugs of the cumulative >&reline hazard functio~ls (stratified Ply clinic) are plotted ngainsL the :og of ihr aurvivd time, see hguw 12.4.

To determine whether the hazard functions for t h m with and with- ~ u t a prlson histmy are propnstiord, we could split t11c data into four ~rrata by cl in ic and prison. Hom~va*, as the wtrnta ~ % t srruller, thc %timated smviml filnctions b~cor~je less pr~xise (bwai~se the risk sets :u cqnation (12.5) become smaller). Also, s s~rnilur method mnld not 3t used to check t,he proportional hazml rarisumption for the ctlntinraous

Page 257: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

---t clinic = Clinic l - dint = ClhlC 2

Figure 12.4: Minus the log of minus the log of thc snrvivzl funct io~ for the t.m clinics versus log survival timc, estimated by stratified CF regrewion.

variable zdose witllout splitting it into arbitrary categories.

Ar~othcr way of testing the proportional hro.ards m ~ i m p t i o n of z d o s say, is to introduce n timc-varying ouvariale equal to thc i r l t w a c t i ~ between t i~us (since admimion) and zdose. thus allowilq thc e f f ~ of zdose to change wer t i r n ~ . To est,imzte this model: t.he tern 7

aqilaLioll (12.5) nwd to bc evalilnted for vrrlum ot the time-vq-kr cuvariates rsl t h e tnmcu of Ihu faiLtr9.e in trhe num~rttlor. Thcse vrll~t- arc not avltilnhle for the denominator in t h e present datitwt sirm e z c snbjcct is rc~)rrsentcd only onre, at the timc of t,hcir own hilure (a?: no1 at all previous fkilmc times). One possibility is to crcab the F-

q~iired dataset (scc bdorv). A sin~pler optiorl is tn simply usc the st- con~rnmd with the t v c 0 3 r d texp0 optiorl~:

stcox zdose prison, strata(clinic> tvc(zdose) /// tsxp( l-t-504)/365.25)

The t v c 0 option specifiefi thc variable that should inherxt with t

Eunction of) time and tihe texp0 option specifies the Function of t w lo he r~sal. Hctc we have simply s~~b t rac t ed thc median s u r v i d tic. so that the coefficient of zdose can l>c interpreted RS the effect .'

Page 258: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

zdose at the median s ~ u v i d time. We have divided hy 365.25 to see I,v hovr ~ r n ~ c h the effert of d o s e changes hetwccr~ intcrvnL3 of one !ear. The output, is shown i ~ r Display 12.3 whrrc the mt,imatd illcrease

f a i l w e -d: states aualgsls t h e -t. tim

Stratifled Cm r e p . -- Brsslou mathod for ties

!lo. of subjects = a38 Amber of oba - 238 40. of fai lures P 150 l l m e a f r i s k - 96812

LR ch1?(3) 34.78 Log likelihood = -597.99131 P w b > chf2 - 0.00110

-t Raa. Ratio S t d Err. z D l z l C95X Cmf. Interval]

Zdose 1.147853 .172010P 0.92 0.357 .8657175 1.539722

Stratlf l6d by clinic

Tote: Second equation contalna variables that contiwoualy vary with respect t o time; variables a m ioterwted with current value$ of (f-504)/365.25.

Display 12.3

51 t,hc hazard ratio for zdose is 15% per yew. T"llis small effect is not siiiffnificz~~t at, Lhe 5% lew,I which is mt~firrnd by carrying out the likrlihoud ratio test as follow:

estimates store modell quietly stcox zdose prison, strata(c1inic) lrtest modell .

likslibood-ratio teat ZRcki2(l) = 0.86 .Issuntptlon: . nested in modell) Prob > chi2 - 0.3579

~ i ~ i n g a velby sinlilrtr pvdue as heforc and confirnling that thexc is no evidcncc that the rffed of duse on the h ~ a d varies with time. A -11nilar test can be carried out for prison (SSP Exercise 12.1).

Altho~idl Statar-nnkes it. rrcry easv to include a11 intcrwfion betwren a c-ariahle alrl a fitnctiorl of timc, inclusion of other time-vnrying c* ~nriates, or of morc than x sirrglc time-wwying cnvaria,te, requirrs ~rpanded version or the current dataset I11 thr wpandorl dattnsei each iuhj~c1's record shot~ld ~PPCRI. (at least) as manv times w that suhjwt contributes to a mk sct in equation (12.5). with the lime variable equal

Page 259: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

to the corresponding failure times. This can be achieved very easily u s ing the s t sp l i t commrrnd. hixt only after deh ing an id \miable using stsst:

stset t m e , failure(status1 id(id) s t sp l i c , at (f ailures) strata(c1inic)

The stsplit m m n d gemrates ilcw time and censoring variablw -t and -d, respectively. For subjcct 103, thcsc are listed using

sor t id -t list -t -d if id-103, clean noobs

giving the %-alum s h o ~ m in Diplay 12.4. 'fhc I%%€ istduc of -t (7'08

-

Display 12.1

is just the d u e OF th~: original variable time. the suhjeck's suni\r or censoring time. whereas the previol~s ducs are all unique survi~x times (at which failures occurred) in the same stratnm [clinic 2) whk- nw less than t h ~ fiul?jcct7s omn s u ~ t i ~ d time. These '.inm~tcd" survitr- times are limes heyo~~d wlich the subject survives, so the cmwriri mricablc d is set to zero for all, invented ti~ncs and equal La status i -

Page 260: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

the original tltiuvivnl time. This new siuviml htmt is equivalent tu the original one; and wc obt-ain the: same result8 as before if we run

stcox zdose prison, strata(clinic1

(output not s l ~ o m ) . However; we can now create timc-varying co- r.nriatcs rnakir~g usc ol the new time variable -t. To assws the prtF partjonal haxuds agsumption for zdoss, WP generate an interaction b~tweer~ zdose end the llncar transfr~rmat.ior! of -t wc wcllg~d previously:

generate tdose = zdose*(t-504)/365.25

\tTc t~ow fit the Cox regrcssioi~, d)m4ng the effect of zdose to vary with rime:

stcox zdose tdose priaan, strata(cl1nic)

giving ihe same result M Beforc in Display 12 5.

M7c cm rstore t.he original data using the s t j o in cornmand after dcleting any t.imevarying cuvm irttes (apart from -t and -dl:

farlure _d: status analynria tam f: t l m e

i d : id

Stratif ied Cox regc. -- Breslcu method for tles

.lo. of subjects - 238 Nmm.ber of obs - 11228 'io. of failures P 350 Time at risk - 86812

LRcbi2(21 = 39.94 Log likelihood - -597.724 Frob > ch32 = 0.0000

I drop tdose stjoin

-t

zdosa prism

A tcst of proportional hazards bnsd on rr.scJ?d Shmfeld or eE- cwnt score rcsiduak (see below), strggesttd by Grambsch and Therneau , 1994), is aka avaihblc mirig the estat phtaat command (sw for ex- ample Cleves rt al.: 2004).

Haz. Ratro Std. E r r . z Pz1zl 195% Cenf . Interval1

.6017887 .0562186 -5.44 0.000 .6010%8 .7227097 1.475132 ,2461827 2.50 0.021 i.oswl6 2.054i38

Stratif ied by c l l n t c

D~splny 12.5

Page 261: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

It is a good idea to produce stmw rwidilal plots, for example a graph 6 the clevimci? resirluak agnillst t h e lincw pprerlictor. In order folh predic to be able to compute the deviance residuals, we. must first store tSt. ~nartinlynle residuals (sw for cxmplc CnJLlt. 20113) using atcox uiti the ragale0 option:

stcox zdose prison, strata(clinic) mgale(mart) predict devr, deviance

predict xb, xb twonay scatter dew xb, mlabslcid) mlabpos(0) ///

msgmbol Inone) with thc result .cl~own in Fip31re 12.5. There nppear t o bc no wriow outliers.

N -

$ - z B

go- t (U 21

r .

c.2 4

Linear predictiwr

Figure 12.5: Ucvhn~cc rcsjdr~~sls for snrvivxl analysis.

Another type of residual is the Sdloenkld or &ffit:icnt score residui defitlml as the first darivatiw of the partid log Iikclihood ft~ucbian wi*: 1,espect. to an exphnatnry v,ariaMe. The score residual is large in 3:-

solutc UP if a case's cxplanntnry vnriahle diflers ~ n h ~ a n t i a l l y fro-

Page 262: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

the rxpla~mtory var iab l~ of suLjer:tfi whrlse mtirnatcd risk of failure is large at thc case's ~ ~ I U P 01 failllrc or (%nmrir~g. Sincc our ~nodd ha.; tuw ~xplmntory wiahlm, we cau mmnput,e the &1.f1ic5cnt score rmiduals for zdose mid prison and store them in score1 and score2 using stcox n-~ih Lhe esr optmion:

stcox zdose prison, stratalclinic) ear~scors*)

Thcse residuals car1 be plotted against s~mvival time using

twoway scatter score1 tt , mlabel(id1 mlabpos(0) msymbol(aone)

and sjufdmly for score2. The rasulting graphs fire shown in Fig- ures 12.6 and 12.7. Subiect 89 has n ltnv valiic al zdose (-1.75) com- pnred mth 0 t h ~ ~ sltbjccts at ~ i s k of faililrc at SUCII n bat~ time. Subjects 8. 27, 12, sand 71 dmp out rcletively late ransidering that t h q liavc a police record. wiierenq others rerr~aining bcyorrd their time of drnpo~it (or censoring) tend to imve no poliw record.

F, a m InQ

1 52 163

g! 149 E Z ; 0

5

I a1

I

time in years

Figure 12.6: 5rxm rwiduak Sor zdose.

Page 263: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

r 2 n

0

01 _ 0 1 2

lims in years

Figure 12.7: Score residuals for prison.

12.4 Exercises

12.1 Retention of heroin addicts in methadone maintenance treatment

1. h the original analysis of this data, Gaplchorri and Bell (1991 judged thah the hazards were appprmi mntely proportional Ic: thc first 450 days (see Fiprc 12.3). Thcy thmeforc a n a l y z e t,hc data for this tirne psrind using cl inic as a amria:< instead of st.ratifying by clinic. Rcpeat this analysis, wicr prlson and dose as futthes mmriatw.

2. Following Capleliorn n d Bell (1991): repcat the abow and- ysis treating dose as c~htc~oricd wrinble mth three lerd:. (< 60, 60 - 7Y, 2 80)- and plot the predicted survivd c.l.rlr\~ for Ihe t k dose rdcguries whcrl pr~eoa and clinic take on onc of their dires.

3. Test for an ~ntcraction between clinic and methadone d~ treatirg dasc as tloth coutimlotts and catcpric:d.

4. For the model treating d o e as cattrgoricxl and containing r?.' intermtion, compare the ~ b i r n a t ~ s i~sing three different mett- ods of handling tics: the Rrcslm, Efron, and exnct metho(l..

. Clledc the proportiont11 hmarrlr msnmption for prison usicr

Page 264: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

a gaphic>al method.

12.2 Survival of patients with primary biliary cirrhosis

T11c data to be a n u l y ~ d here rDme from a Mayu Clinic trial ajnduc.t.cd b e t w n 197d and 1984 and hocompx~ly thc book by T~CTIIC'RU and Gra~l~bsch (2U00). They were PI-rvinusly malyxerl l ~ y Dirkson et al. (1989) and Pl~rnirlg ant1 I1x~rinp;Lon (1991). 1% tients with prjrr~ary Oiliwy cirrhtsis were rsndoniizsd to rcc~ivt! c i th~r U-prnidlhminr! or placebo. The tr~ntrnarlt war found notJ to kc? cFFect,ive in prolonging survival, and t t ~ c data, supplen~ented with an udclitiontd 106 patlrnls JIM ptu*t,ieipeting in the trial, Ilavtvr h ~ e n used to dcvclop a morlcl for survivnl in a "natnral hislory setting".

TSic vuiohlw in pbc. d ta that will be used herc x1.e:

II id : suhjcc:t idaltifier ,

a futlrne: nu!n\>er noE days b ~ t w ~ n 1~egistratio11 and Cho ~ttrlier ol' liver t.sansppltrnt. tlcath, or cllcl of' follow-up

m status, status {O=nlim, I=liwr tmnsplni~t, 2=tbd) age: R ~ T it1 dxvs

r edema: prescncc of edema (O=~in cdclnn and no thctmm for ~~PIJ IH- , 0.5=dcmal citl j~r 11ot t r~n ted or rcsolvcrl by treat- r~~snr , 2=wlcmn dspii.c ~rurltn~el~t.)

m bilir : w:rnm bilirt~i~in cunl:cntrrltion (mgJrll) m prothr: probilrtnnhin tiwic ($eto:clncls) a album: HJh~lmir~ r:onccntration (rngj'rllj

1. Fom thr nxLnml logaribhnw nf bilir, prothr, anrl album ar~d corrve~t age; b~ ec in yrws.

2. Fit, 0 Cox egression ~nudel with bk~c above trilnsforrned wri- ahlcs nnrl t l ~ c cntcgoticnl vwinble edema as ccoariktttcs. Trrdt, liver tr~trlsplmt;ttinri m censoring.

3. Tnt~~*pret thc e ~ t ~ i r n ~ ~ t c d hwmd ratios. 4. Rclw the proportiont~l )~axnrds ~ ~ u n l p t i o r l For age using R I ~

in1ernt:tion wit,ii mlFllYib t,ime mjrl use a likelihood ratio tc& to assess the wsn~nptinl~.

5 . P~rlilrnl an ~ l n l ~ g o u s LpsL for edema. Also usc a graphical nleLl~ud for asscssil~g bbc piopc>rtional ktrlnsarrsds asqumnptiu11 for edema.

6. Plv>dilcr and plot efficient st:om rtlsiduills for all covnrintcs in thc rnbrlel.

Page 265: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

260 W A If~ndLwf; vf Statistid Annlyse-~ Using Shta -

12.3 Duration of UN peacekeeping missions

Dux-Stdfensrneier and Jones (20011) prwilh a datnsct on t.he du- ration of U N pcactrkeep~ng rnisiorls between 1948 and 2001. The datr~ were originalIy .~nalpwl Ily Green ef al. (1998).

The m~iiibbs in un.dta nsd Tor this exercise are:

duration: dur~lion of pmekccpirlg rnission until the earlie of the cornpletirm datc or 2001 complete: duminy wiahle For pcmekeeping missiau beinr co~npl&.td

1 contype: the t.ypc of conflict that Icd tu the pencckeepk mission ( l=dvi l war, Z=jntesstate conflict, J=rntcrnatition;i- iaed ~ivil war)

1. Produce Knplar1-Meier survival C I I W ~ by type of mntcst. 2. Fil n Cox proporliorial hazards model to investigate the e5

fcct of type of conflict on t.he dr~ration on UN p c w k p i n c rnissionn. Urn the exact 1~1ethod for handing ties. Inkrpw thc cstimaws.

3. Pkrt the model-implied survival m ~ r v e s for the three t y p ~ cf mnRic t and r:omparc the111 with the Ksplan-Meier curves.

1. T c s l the nirll hypoL1imi.r that typc of cont~3t docs nut affm- tlrc duration of pcwkeeping missioris using a Wald test htp: on thc cstirnat,es Fron~ the Cox ragrc.ssiou md using n. 1% r e test (sw help sta test) and romparc the rwults.

12.4 Tkeatment of prostate cancer

Itere wc C O I L S ~ ~ ~ T data from x dinicd trial for the trcatlnenr c' prmsntc cancer that were previo~~sly ai~alj=zd by Collctt (2W and Everitt ad Pickles (2004). (This is a su~l~set of data annl!-zi.: bv Andrew and Herzhrrg, 1985.) Patients wcre randomized - -

l rng per day of dietliylstilbestrot (DFS) or placebo ~ n d ~IF= survival ivu, rarordcd irl months.

Thc variables in survproat . dta me:

time: timc from thr st,arL of the trial to dcnth or censor= (in months) status. dummy vxiablp For dcnth (v~rsvs censoring) treatment: dwnn~y variable for DDFS trentmcnt (~-rmus p I m age: a* at the start of the trial (in years)

m haem: saum hacmogiol>in level in grn/lfl(hl w $leason: acombirl~l iridex of tillnor stage and grade (a 1%-

index indicates a mow xlva~~eil titmor)

Page 266: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

1. Fit a Ctm rcgrmsion nod el to e d i w t c t,he uwIjustcd ku- ard ratio for DFS treatment ver.ws placebo. Interpret the estinmted treatnienL ef fx t .

2. Use a fonvnrd xlextio~~ proceduw to select d d ~ t i o n d comri- ate , in iidrl~tion to treatment,, aTnoIlg the other variables listed above. Note that yo11 should form treatmen% to be in the modcl (see help stepulse).

3. Inkrprct the ostirnatcr. 4. For the selcctcd rnudd, plot t l~c model-implied survival curvm

1~ trcatmei~t group, evaluating the cmmiatcs at their Inran.

Page 267: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 13

Maximum Likelihood Estimation: Age of Onset of Schizophrenia

13.1 Description of data Table 13.1 gives thc agm of onset of schieophre~~ia (detcrniined as age on first arlrni%sion) For 99 women. These t1at.a will bc ilserl l o hmt lgn ts

I nhethcr thrrc is arW evldcnl~ for the subt,ypc model of w h i z o p h r ~ oie (see Ln-ine, 1981), according to which tlcre arc two typcs of <cllisophrt?nia charactcrizcd by carly and late onset.

13.2 Finite mixture distributions

, Tlie inast conllnon type of finite mixture dist.ribut.ian for corilinuo~~s rpaponsm is a rnixlrlre of univariate normal distrihlrtious of the form

7-hwc g(y: f i , ~ ) is Ilre n r ~ r m d or Cmlssian density with rrlcarl p and csndard d~\ritxtioii u,

Page 268: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Table 13.1 Age of onset of schizophrenia data in onset .dat

and p, , . . . ,p i . RM mixing pml~~l~i l i t ies with pJ = 1. r . 1 he pmtur~~.tws p' = (pl , . . . , p k ) , (a' = ( 1 ~ ~ ~ . . . , pk), and a' =

(a,, . . . ,a$) are lisunlly rstilnntcd I)y rnlrximu~n likelilior>d. Stmiclard al-rors car1 he d~tainerl in thc usual w ~ y fro~ri the ol~qcrvd illforma- lion ir~ntrix (i.c.. En)m bl~c inve~sr r ~ f thc Hcssinr~ matrix, 11ie m;tt,rix oC sccoud drrivatiws of tlw log lilwlil~ood). Determi~~ing the number d c(1rr~puna11l.s k i r ~ the mix1 u l r is more pl~abl~t r~~t ic si~lrc tlir cor~wnf~iona: l ikcli l~~ud ratio tcst canfiol bc u r d tn ooniyurr. models wilh diffprer~r k . a point. we wiIl return to latcr.

Rlr R short ~ntroduclion t;o fiiiilc inixtirrr motlcli~g, w Evcritr (1996); n nltlrt! rrrn~prclie~trivti nccount is 1Sjwn in McLaclllan mid P& (2000). Mxximmn~ likelih(~nd cslirnition using Stuta in dRwlhibed ir: dwt,nil Iw Cutilcl P L al. (2006).

13.3 Analysis using Stata Stnta has R coiur~md rallcd ml, which car1 be 11s.d t.o lnxxirnis~ a user-specified log likelihood ~~sirlg t1r N~wton-Thphson nlgoril hrri. T h itlgorith~n is it.ernlivc. Starting with initial paramctcr villim, thc prw gram evnlaates t.he first iu~d second derivatives or the lug likelihood at thc parurnet~r vnl~~cs tu find A IIW~ SP~, nf paronlcter valups whew tlw likclil~ood is likclv to he grczter. The dcrimtiws Itre tl~cn cwlw nlcd for t.Ilc ncw pmmrtcrs to u p d a t ~ thc puaxnst,~rs a g i n , etc., unti: lllc maximum I& bt3~~1 found (where t,hc lirst derimtirrs are zero an? Ihc sewnrl dcrivntives negalive). The EM algorit.hm, a n altcrnati~~ ta Kcu+on-Raphmtl, is often ?nclicved to I,? ~upc*rior for firlilc m i x t n ~

Page 269: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

niodels. However, in wlr experience the implementation or Newton- RapIwon in Stat& works very d l for these m o d ~ h .

To u.se ml, the user must write a program that ccv~lunlm thc log lik~lihood a~lcl possibly its derivativcx. Thc ml comrrlnnd providm foir rncthuds: do, dl, d2, and If. The do method does not require thc user's program to walti~tc any derivatiws of thc log likeliioorl, d l requirm fimt derivatives only, and 62 requires both first and xeonti clcrivatives. IVhen do is A w n , first and sccorid derivatlvcs are forlud numcric:ally; this makes this alternative slnwcr ar~d I- accurate than d l or d2.

Tbc sirriplast approach to w e is If which dso does not r ~ q l ~ i r c any derivatives In he progranlmcd. Irist~~d, t,he structure or most tikelilmod problems is u a d to iricreasc both the spwd itritl the arcuraq of the nurucnrd differentiation. Iliher~m dQ to d2 can bc ilsed fur any maxi- mum likdihood probl~m, rnethod If can only be used if the 1ikdiEiood sntisfim the followiria twu criteria:

1 Thc units of ohffemtions in t.he dataset are iridepperlde~lt, i-e., the lug likelihood is the sun1 of t.hc log likelillood cmitribulions of thc.

2 . The log likelihood contributions have a linear f o m . i .e., they are (not necessarily linear) fimctions of linear prdictors of the form pI, = Y I ? ; ~ , + - . . 3- zkzSk .

The first rmtrictiari is us~rally met, m cxceptiun heing longitudinal or multilctvl data. Thc second restriction iy not as severe as ~t appears bccausc there rnny bc sewrd linear prcdi~tors. ns wc shall see later. Thmr rcst,rictir)~ls allow lf to evduate derlvntivcs efficier~tly md ;ac- runtalv using dlajn rule. Fbr exmipIs, first derivatives are ohtdned as

n-hcrc? I, is thc log likelihood contribution from the zlh ubservalion. All that iy required are the derimtivw with rwpcct to the linear prcdic- tor@) lmni which the tlerivativcs with rmpeLt to the individud perm- cLei-s follow aiitorn&ically.

In this chaptcr, wc will give only a Lrief introduction to maximum likclihc~od estimation using Stata, restricting olu cxarnplas to the lf method. We recorrune~lld the book nn M m r n m n Likelthuod Eslimotion vnth Stata hy Gould et al. (2006) for a thurot~& belalment of the topic.

We will eventually fit a mixture of norr~ialu to t,I~c age of o ~ ~ r c t data, h u ~ will introtlucc the ml proced~irc by a serlcs of siinpler models.

Page 270: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

286 . A Handbook of Stulialied Anduacs Useag .%la --------

19.9.1 Single n o m l density

To bcgin with, we %-ill fit z normal distribution with standard clcviatior fixed t ~ t 1 to a sct of sirr~ulaterl data. A (pseudn)-raridorn samplc fror the i~orrnd distribution with mean 5 and standard dcviatiur~ I can b. obtained using the irlstructions

clear s e t obs 100 set seed 12345678 generate y = ~nvnormal (unlf o m ( ) ) + 5

where the plrrposc of the set seed command is simply lo rnvtue tha- the samc dat:a will be generated each time wc reprat this sequence c: commu~ds. \% use summarize to confirm that the saniplc has n mear r:losc to 5 mid a ~t~sndard deviation close to 1.

summarize y Variable [Lbs Mem Std. Dev. MID n~

3 100 5.002311 1.053095 2.112869 7.351898

First. WE will definc a program, mixingo, to cval~.lnat,e thc 1% l i L ~ lihood contribnlions when called Eroin m l . The program must ha\= two arguments; the first i s t h e variable n m c where m l will look fc r the curr~puted lug likelihootl contributions: thc w a n d is the variah - name containin:: the '%urrcntn valuc of the l i n o ~ predictor during tt- itexative maxi~nixst.iox~ procsd~ure:

capture program drop mixing0 program mixlug0

verslon 9.2 args lj xb

tempname s

scalar ' a ' = i quietly replace '11- = ln(nomaldenI$Ht-yP.'xb~~~~~))

end

After giving namcs 1j and xb to thc argnu~cnts pim9cd to mixi%: by ml, mixlngo defines a t c m p o r q nanic stored in the local macr s mid sulmquently defirlcs n scitlar having that name and taking tk+ valile 1. This s d a r represents the standard deviation owd in &he n e r eommmd. Usirq temporary names avoids ;ulv confu~sion wit,li w i a b I ~ . that mag1 exist in the dataset. The final cornmuid returns the log ,-

thic normal dcnwty in ' l j ' as ~ c q l ~ i ~ d by the calli~lg program. Her- we uspd the normalden(y, o, 1 ~ ) function to calcu~atc: the norrr.2 density in cqiiation (13.1) where y is t,he dependent variable rvhw-

L

Page 271: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

narrlc is stored in the global macro ML-yl by ml. The mean 11, i. just the linear predictor 'xb-, and the standard deviation rr is thc xc&r ' s + ,

Note that, dl vmiables dcfirled within a prugram to hc u s 4 with ml shauld be of storagc type double to enable ml to cstjmatc m r & t c numerical derivatives. Scalars should br used inst~td of local miacrus to hold consrtar~ts as xalms have highm precision

The oornmands above can be typetl into a do-file. After n~nning the conimands, d d n c the model using the command

mZ model If mixing0 (xb: ys)

~vl l td , ~qxcifies the mctholi as If and t.hv pmgam to evaluate the log likelihood mntrihutions s mixingo. The rmponse and explmatory \"miable arc givcri by the "eq~uat~ion" in puwthcses. IIcre the nbme tlcfare tkm colon, xb, is the name of tllc equatiorl, the variahlc after rhc colon, y, 1s the rwpnnse variable, and the rmi~b1es aftcr thc "=': are the explanatory v~riables contributing tn the linear predictor. No eq lmatoq- variables arc given h~rc. sn a constant only model will bc htted.

As a rcsult of this znodd definit,ioion, the global m-yl will be qua1 tn y and in mixin@ - xb - will be equal to the int,ertqpt parumct,er (I he mean) that is going to bc estimated by ml.

Now we maximize the log likelihood !]sing the command

ml maximize, noheader

giving the results show in Display 13.1. The prtjgrarn convcrgmi iri

D l t i a l - lag 1ikellhood - -1397.9454 alternative: log likelihood = -1160.3298 rsscale. log l i k e l i h o ~ I - -197 02113 Iteratron O l o g liksllhood = -197.02113 Iterati911 1: log likelihood = -146.74WZ Ireration 2: log likelxhood P -146.769Bl Irsratton 3: log likelahood - -146.78981

Y Cmf. Std. Err z F>lz1 195% Conf . latermall

-tons 5.002311 .1 50.02 0.OOo 4.M6314 5.198307

Display 13.1

r hrce iterations and the rnmi~nunj likelihood estimate of the Inem is trrual to thc sarr~plc mean oS5.002311. If m-c werc int.erm&xi iri nbsenring

Page 272: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

268 . A H o r r b k of Sfailsllrol A n o l y w (halag Strata -. -. -. - - -

the value of the m a n parampter in eadl iteration. wc could iae the trace option in the ml maximize cululllm>d. We u d the noheader option to suppresb outpnt relati~lg to the likelihood ratio test aginc. a * ' r ~ t ~ t l " model si~tcc we haw not s p e & ~ l such a modcl.

1% <rill norv cxtenrl the prograin step by step urllil it can bc u e to estimate a mixture of h o Gausians. The first step is to al lm tk sknndard devktiuu to h~ ~tiruatcd. Since this parameter dues ncs contrib~~te liuculy to the linear predictor izsed to estimate Llrc mew:. -2 must define another linear predictor by sp~c:ifying another equ* tion will1 no drpeudcnt wriallle in the ml model r o m n n d . Assumin: that Ihe program to emitlate thc log LiIi~Iihood contributions is callo: mixingl, the ml model mrnmnd bpromes:

ml model If mixingl (xb: y) (lsd:)

The new cquflfion has the name lsd. has 110 dependent variable (sine y is the onb- rlependent \variable). and the linear predictor is simply s mllstant. 1% short-form of the abow command is

ml model If mixingl (xb: y=) / l sd

\W intmd to use lsd as thc log stantlard dm-iation. Estirnnting t b log of the sfmdard d~\+ittion will enswe that the slandrtrd rlevintic.: itself is positive. 11% rr-ill nrm- modif?. the function mixing0 so that - has an additional argument far tllc log standard deviation:

capture propam drop mixingl program mixingl

version 9 . 2 args 1j xb la

quietly generate double ' 8 - = exp('1s-1 quietly replace - l jm = hInomalden($M-yl,'xb','s'))

end

ITc now d&c a temporary rarinbk s instead of a scaIm because 1%

liucxr predictor '1s' is a rarinbl~. This is b ~ a ~ u s e :t!lie linear prcdicrrr is dcfiu~ed in the ml model command and codd in principle contc covariates {scc Eyerrise 13.2) and hmlm differ b r t \ ~ e n nnits. The ter- porq- wrinhlc name will not ch11 \\-it11 any cxistiw mrinbh n a m e and the wriable nil1 automatically be deleted when thc p r o g m E - f i n i s l~d nmuing. R1mn.g

ml maximize. noheader

gives the output shmk-11 in Dipla3- 13.2. T11e standard devintion es?i~lm te is obtained by mponentiatiq t +

wtiltimated mdcicnt of -cons in ~ ~ ~ ~ a t i o u Isd. lnstcn~l of typing displq

Page 273: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Muximum Lskeldmd Wbmullorn Agp of Onset of Sr.laazophren~a . 289 - - - - - - -

~nitial: nltesnat ive ! rescale: rascale eq: Itsratlorr 0: iteration L : IEeration 2 : Iteration 3: I t e ra t ion 4:

log l i k e l l b o ~ d - -1397.9454 log Likel~hood - -634.94948 log l r k e l ~ h o o d = -3Of.15405 l og likelihood - -180.568G2 log I lkel lhood - -180.66802 log likel~hoood - -147.59179 log Ifkellhood - -146.58986 l o g likel~hood - -148.5647 log likelihood = -146 58489

-cons 5.002311 .lo47816 47.74 0.000 1.796943 5.207619

-COUB .0469082 ,0707107 0.66 0,508 -.0818821 ,1852986

Display 13.2

exp (0.04670821, can use the hIZowing syntax for xrnwing cocffi- cients a-~id t.heir stnnda~d errors:

display Clsdl -b C-cons1 .*A670833

display [Isdl -se C-cons] .I7071068

:I> s m also omit :'_b" Crom the first expression, i l11~l conlplltc Lhn re- tiired star~drud devialion 11si1~

display exp( Elsdl [-cons1 1 ,1678153

-. i h is is smaller tlmn t-he sample sturldard deviation from summarize

brtc'a~wc Ihe mminmm likelitlood estimate of t l~c standard deviation i s frrl-i I>y

(13.2)

~ i e r c 71, i s the sn.inpla size, wl~ereas the factor A is uscd in summarize. Si!~cc 71 is 100 in this caw, the rnmximurn likelihood c5timate milst Re . .

--,,:o~vn 11p" by n f;tc:tor of to obtain t l ~ c sample stnndartl iriiation:

display exp( [lad] [-cons] )*sqrt (100/99)

Page 274: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

l3.3.2 Two-component m a u w model

TIE progrnru can now he exLcnrled to estiruattts a mixture of Iwo Gaws si~ns. To dlow us to test lhe program on data frorn .a knowri distribm tion, we will si~r~rllate a smplc from a nr ix tw of two f>ailssinns rr-it: skuldard cleviations c q i d t.n I and rncans eqlld to 0 md 5 and wit: mix in^ plnbubilitiw p , = pz = 0.5. This ~ A T I be dono iri two st.agcs; firs rallclorniy dloctte! observations l o gronps (vnriahle z) with probabiK- tics pl ant1 p,, m d Il~ccn mmglc frtml thc diffrwnt conlponent de~isitir: ~ ~ ( m r d i ~ l g to group ruen~hcrsliip.

clear a e t obs 300 sat seed 12345678 generate z = cond(unifom()<0.5,1,2) generate y = invnormal(uniform() ) replace y - y t 5 i f z=-2

We now need fivc li~lsw predictors, onc for each parnmstcr to h. atimatcd: p, , IAz, 01 , vzr ni~d pl (sincc p, = 1 -PI). As heforc, we car cnsme that the ~siirnuted sharldarcl rlcviativns air! pwitivs by takirr Ihc exponcrltiul in~idc the progmn. Thc mixing p~oportion pl rtjust := in Ilw rangc 0 ) p, 5 1. One way uf ens~ll*ing this is lo interpret t5= lincnr pl*mlictor as cseprrsmti~lg the log odds (RPP Chapter 66) so tttz- pl is olllaind froin tlie linear predictor of the log odds, 101 using 1%

trxnsforrriation 1/ (l+mp(-lo I)). The pprogmnl r1ow heco~llcs

capture program drop mixing2 program mixing2

version 9.2 args Ij xbl xb2 La1 Is1 Is2

tempvar fl f2 p s1 s2 I q u i e t l y I generate double '61' = exp('ls.1') generate double '62- = exp('ls2') generate double 'pe = l/ (l+exp(-'101~) 1

generate double 'fl' = nomalden($ML-yi,'xbl','sl-) generate double 'f2- nonnalden($ML-yl,'xb2','~2') replace '19- - la( 'pm*'f l ' + (1-'p*)*'f2-)

1 sad

Page 275: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Hcre WE IIXVC applied quietly to a whole block of conimands by en- closing then1 in brmes.

Stutn simply us% xcrors rts starting ur initial d u e s for dl parm- etcrs. Howxv~r. iL is not. advisable hcro to starb with tlic smnc initial ~ ~ I U P for both cu~npo~lent rricaar. Therefore t i e stming values should he set using thc ml inlt cornmmtrlds s follows:

ml model If mixing2 (xbl: y-1 /xb2 / l o 1 /Isdl /lsd2 r n l init 1 6 0 0.2 -0.2, copy ml maximize, mheadar

In the ml iniz mrnrnmd, the first two d u e s are initial valucs for the nexus, the third fur the log odds, and the fourth m d fifth for the bm of (he standard dcviationt;.

Thc results are shown in Display 13.3 where tbo standard deviations

log lik~llhood - -746.68918 log llkelkhooa - -746.68918 log lakeliheod -676.61784 log likel&bood - -676 617E4 (not < log 11kelihod - -630.71006 log lilrsllhoad = -825.89 log likelihood - 629.23409 log likel%hood - -622.21163 1% llkel~hood = -622.21162

-- y 1 Coaf. ~ t d . Wr. z P>IZ, t957, ~onf. Interval]

Display 13.3

1 display exp( Clsdil [-cons] )

I )-190552

Page 276: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

and

display s x p ( llsd2l [-cons1 1 ,9828052

n ~ l d Lte prohfibility of nlernbwship ill group 1 as

Thc maximum likdihood wtirnatm agree quitc closely with the tm parameter vducs.

Alternative cstim~tes of the mixing probability and means and st* dard deviations, treating grotig mcrnhership m known (irsmlly not sible!), are obtained using

table z , contents (freq mean y sd y)

-. 02067a6 1. P O X 5 4 5.012208 .854237

and thew are also similar to the mrvrimurn likel~hood estimates. T b rnaximum likelihoorl estimate of the proportion (0.528) is clracl to tb realized proportion 160/300 = 0.539 than Ihe "true" prflportiw 0.5.

The standard mwrx of the estimated ?news u p j$vcn in the r+ gressioil tahIe in Display 18.3 (0.081 nrld O.ORG). We can =tima:: the MancIard errors of thc standard dcviatiwx and of t.hc probabia it? from thc titacldard wrors of the log star~dard deviations and Iw ncwn using the deltn methr~d (see for cxarnpk A p s t i , 2002, pages 5 7 - 581). Arrorrling to the deltn mcthod, if y = ftz), tllt.11 approximatek se(g) = ( j ' ( x ) I s e ( ~ ) where j'(~) i b t,he first deriwfiw 01' f ( r ) with re spect to z evaludcd at the wtiuutttfd value o l x. For the stantlar: deviation, sd = cxp(lsd), so that, by the della mcthod,

For the probability. p = 1/(1 -i- wp(-lo)), so th;tl

However, nrl PVOII easier way of obtaining and displayirg a function c ~ . coefitcients with tlle correct stanchrd error in Slatma is r~sing. the nlm command:

nlcom (sdl: exp( Elsdll C-cons] 1) (sd2: exp( [is821 [-cons1 1) I," (p: i n v l o g i t ( [ lo l l [-cwsJ )I

Page 277: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

~ d l : exp(Clad1l tcons l ) adz E Z ~ I [ l sdal t censl)

p: 1 n v l o g ~ r ( ~ o i l [ _ c o m s 7 ~

J h e t . Std. Wr. z P > l z l 195% Conf. ratenall

sdl ,9729058 ,0610799 15.33 0.000 ,8531913 1.09262 .dl I .8128012 0611851 1 1 . 1 0,000 11550813 1.111627 r, .5280309 .024208T 18.08 0.OOr) ,4707829 .585279

Display 13.4

In c e d ~ set of parcnthwes, wc spedCy a iabpl, FoHoaml by a cden and rilexi an expwsion dcfirhg the fi~nctioa of stored estimates we are tr\tarcstcd in. H a we used the invloglt0 fullctinrl tc obtain thc probability li*om the log odds. Stata then uses numerical derimtim to work out tllc rol*rwt htmdard pmor using the delta-method giving the rpsult,~ shown in Display 13.4. The z-statistic, p-vnlrw, itnd confidmce ,ntel\val should be ~grlored wllesa it in reasonalllc to a.muna n. normal .ampling distrihutiorl for rhc dcrimd pwun~ctar.

It'e now apply t,hc same progrtrnl to the ~ t g ~ of onset data. The data can hc rear1 in using

in f i l e y using onset.d;rt, clear label variable y "age of onset of schizophrenia"

A usefill graphical display of the data is x l~istojirarl~ p r o d ~ ~ r ~ d 11sillg

histogram y, bzn(12)

~ h ' i c h is s11owo in Figure 13.3. It eecms wnso~inhlc t,u ilsc initmid values of 20 arid 45 for the two

!!IPFITIS. In additinn, we wjll use a mixing proportion of 0.5 and log sandawl dm-iatiola of 2 as illitid d u e s .

ml model If mixing2 ( y : y) /xb2 /lei /Isdl /lsd2 r n l init 20 45 0 2 2 , copy ml maximize, noheadar

The ontput is given in Display 13.5. The mPans nre cstirnated as 24.8 ~.nd 46.4. tic standard clwiations as 6.5 and 7.1, and the mixing prro- 3ortions (1.74 zlld 0.26, for groups 1 md 2, respecliw1)f. The approx- -rnatc stallrlard crrors may he obtairl~d as bcforc sc~ : Exercise 13.1.

Ke w ~ l l IIW plot Lhe estimated mixture density tnmther will1 a ierncl derisity wtimate. Instead idnf ~hsing the conimancl kdenslty. we r i l l use twoway kdensity, alloaring us t o add the mixture density onto -hc same graph:

Page 278: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

~ n l t l a l : l o g Izlral+hood = -391.811M rescale: log likalihood - -391.61146 rescale eq l og P l l r e l ~ o o d - -392.61146 I tera tam 0: l o g l iksl lhood = -391.61146 (not concave) Itsratlon 1: log lrlrnllhood - -374.68715 (not coucave) Iteration 2: log l i k e l i h o o d - -374.36498 Iteration 3: l o g likelihood - -373.94157 Iteration 4: l o g likal~ood - -373.67092 Iteration 5 : log l ikel ihood - -373.66896 Iteration 5 : log l ikel ihood = -373.66896

Display 13.5

Page 279: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

20 1U ago of onset of schlzophreEla

graph twoway (kdmsity y, uidtb(3)) /// (function invloglt ( [ lo l l -cons) /// * namalden(x, [xbil -cons ,pxp( I lsdil -cons)3 /// + (1-invlogit ( Clol3 -cons)) /// * normalden (x , kb2l -cons, exp( clad21 -cons) 2 , I / / range(y1 lpatt(dash)), / / /

xtitle("AgeW) ytlrtle<"Densityu) / / ! legend [order(i "Kernel densrty" 2 "Mixture model "))

In Figure 13 2 the two est.imt,e~ of tile ddwlsig we surprisingly similar. .idrnittedly, ~ v c did spccity uidth(2) for the half-width of thc kcrncl

7r7causc rhis gay* a good fit!) The histograrli and kernel density est,im&te suggwt lhzt there arc

-a-o .subpop~llations. lb tcst bi~is more fc)rmally; we could dm fi t a -:nglc normal distribution and compare the likelihoods. HOW^^: as zcntiond earlier, the convcntir,nal likclihnod ratio test 1s not vdid r2ic. TVolfe (1971) sugge:csts, on thc basis of a limited simulation stndy, -3dt the differcncc in rninus twice the log likclit~ood for a model with - rorllponccnts cornp~wwl with a n~ntlel xvith k f 1 cnrr~ponents has ap- --~umatelg n X' rIistjtribut,ion with 2v - 2 deems of f r ~ d u m : here v is -: I= number or extra permnete~s in the k + 1 comnporicnt mixture. The

n likcliliood of the current model may bc itcccsred using e (11). We

Page 280: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

276 . A IIandbook oJSlahslaml A n u I ~ s ~ . . Vmvag Stnfta - .

m 0 -

,-. C ..

3 - I \

rn

go - C

:g-

- ? -

a -

0 20 40 Age

1-Kerneldens@ Mixturn model

Figure 13.2: Kernel and mixture model deusitim for the age of onset dntn.

store this in a local macro

local 11 = dl11

and fit the single normal madcl u s i q the promam mixing1 ns follows;

ml model If mixing1 (xb: y;-) /Isd m 1 i n i t 30 1.9, copy ml maximf 2% , mheader

with the rwdt shoum in 13ir;ptrty 13.F. Comparing the log lik@lihoo& ~rsing the method proposod b ~ - Tl'olfe,

local &chi2 = 2*('ll'-e(l1)) display chi2tail(4,'chi2-)

.OOW3994

wnfirms that there appear to be trva s~ihpop~datiomi.

Page 281: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

~hitial: log l i k e l ~ h c c d - -428.14924 rescale: log likelihood = -429.14924 rescale eq: log likelihood - -429.14994 I tsrat~an 0: log likelihooa - -428.14924 Iteratloll 1 : log I ike l~hood = -383.94844 Iteration 2: log likelihood * -3%3.4

-

Y Cosf . St&. kz. z P> l a l IPS% Coal. Interval]

xb -cons 30.47475 1.169045 26.07 0.000 28.15345 32.76503

- - - - -

Display 13.6

13.4 Exercises

13.1. Two-component mixture of normals

1. Crcatc a do-file with the camma~~rls necessary to fit the tna- cunlponcnt mixture OF riormds discussed in tl~is clraptcr.

2. Add commands to the end of the dr~file tu cdct~lxtt thc stan- dard deviations arid mixing pmhd7ility and the s ta~~dard el*- row of the% parnrneters. W?mt are the standard errors of t , l ~ mt,imatw for the age of onset. data?

3. Sirnulathe values kom two llormnls t ~ i n g nut different d n w For the vario~is parameters. Du khc esti~natcd values tend to lic within two estimated stmdard errors from the "true" vduts?

' 13.2 r Hderoscednstic linear regression

I 1 . U ~ C the progrrtla mixing1 to fit a linesr r~gressinn model to thc slimrnirlg data. from Cllrtpt~r 5 with status m the only explanatory msiahle.

2. Cumpare the ski meted resid~ul ~ta~tlldard deviation with the root mcwl squared error oblained ilsing the regress con- r n ~ ~ i d .

3. Usc thc: smne program @air1 t,u fit a linear regI~sGon model X ~ I P J C the. variflncc is d lmed tu difie~ 11setwecn the grolrps defined by status. (Hint: i\,Iodily (he (lsd: ) cquat.ion in Ihe mi model r:o~nmaod: see pug* 268). Ia t h ~ r c any evidcncc

Page 282: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

for heterrost:~lnsticit:y? How do th~: resi11tS rnmpare with t h m uf sdtest?

13.3 Three-component mixture of normals

1. Extcnd the progrrwu mixing2 to fit a mixlure of three 11ormd: and test this on simiilated data (lint: nsc t rm~sformatio~~ pl = 1 Jd, p2 = exp(lol) Jd, and p3 = exp(bZ)/d wdiarc d = I t exp(1ol) + mg(l02)).

13.4 Twc+component mixture of Poisson distributions

Har;sclbM (1969) fitted a two-oonlponent ~nixture of Poismn dtL tributions to the aurnber of dcnth notices of women a g d 8U an? owr published iri t;he New Yo& Times between 1910 and 1912 This c las ie dalmel is tab~ilatad below

Number of11Dt;im 0 1 2 3 4 5 6 7 8 9 Ftequency 1WJ 267 271 185 111 GI 27 8 3 1

Thc model can bc written as

whwc p1 arid pa are the means for the two oompone~lts an< is the pruhabiiity of belowing to the first co~nporient. H e

selblsd ohtined thc esttimatcs jil = 0.3599, = 1.2501, an? fiz = 2.6634. This fiohdion was interpreted zs indic:ating a differ- ent pattern of draths in thc winter (component 1) and summe (mlnpona~t, 2).

1. Write n prqqarn to evaluate the log-likelihood contribntior for the ml command with method If. Nnk that thc modd ;_i

defined only for pl > 0 and > 0. YOU C&II IISC the functiw lnf actaria10 to evaluate ~(PJ!).

2. Enter the data. (Hint: Use tbc expand curnmand.) 3. Fit thc model. Any discrepn~~cie? between yyor muIts en?

t l ~ w giwn a h m are likely due to prognamrning errors. Y-DC can rule out thnt it is due to poor starting va l~~cs by usin: valuas close to tE~c required rwults. Revise your program ! nsccuaary.

Page 283: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

4. Calculate the expected frequencies (or each nun~hcr of no- tic= and compare them graphically with the ohe rvd fre- quencies (Hint: You can ~usc thc prugrarn that emlimts the log-likelihond contrib~itions to calculate thc log prubabilitis for different v d ~ l w of y by dcfining the global ML-yl apprc- priatcIy a11d passing the w n e of an existing variable and the estimated ~ W R T I I ~ ~ P X S to the prc)grm as arguments.)

13.5 Latent class model

Val1 der Heijdcn eb nl. (1992) analyze data collertad by the Netther- lands Ministry of Ju.;Lice to inrwttgate diflerenc~s in involvement in crime among young people from four ethnlc groups: Moroc- cam, Turks, S~urinmcsc, art4 Dutch. The Dutch sarr~pk con- sisted of g~ople whu lived in the same streels FIS the people from thc other dknie groups. Three crime mmsura were gathered front police records: Propcrty crime, aggres~ioll again.& pmons. and vandaltxm.

Thc vELfiabl@s in crime .dta are:

r vandalis: ir~dkutor For being mrwted for vandalism aggress: iuldicator for agpssiun ugain.t a p c m n

B property. indicrttor for being arrested fur property &me 5 ethnicity: cthriic group {l=Moroccan, ?=Turkish, 3=Suri-

namese, Q=Dutch) age: &ge group (l=12-13, 2=14-15, 3=16-17)

A..umir~g that therc tlrc srlkpopulations differing in their patter11 ul responsw on the t,krm rielinq~~cncy itprm; we will con- sidcr a finite mixture mudel with two components fur sinkplicity. What 1s different between this modcl md the othcr finite m i x t ~ n t r n o d ~ k disctlwd in this d~ctpter is that thew w~ three responses per subject. Assuming that responseu are indepen(ndpnt within the htxlit clsisscs (an asuulption known as conditional or local independence), the nlbdei cxn he w r i t t ~ ! ~ &s

whcrc u, 1s the log odds that ttm it!> respnnsc ygb, cquzls one for n pewon in latent class c. tlcrc y, m ~ l l t i p l i ~ rr,, in the ri~~morator to proriuce a numcrotor q~qrlnl to exp(o,) ~f g, = 1 ar~d aqnd t~ 1 if y, = O as reqtrired.

1. Write a program to evaluate the. log-likelihood contributions

Page 284: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

280 R A Handlmok oJ SlahhcJ Analyses TJssng Stadn --

for ml. Note that t-hree rmponse mrjahles must stc specifid 111 tkc m l model. You can sirnply i~hr: equations of tlie fom (name1 : vandalis-), (nurnel: aggreasi-1, and (name: property=) to sgeciry the three rmponse variablw and ow lincar prdictor for &. and then use equations without r+ s p r ~ w variables for dl the remaining linear predictors. Tk nmcs of the rwponse vnriables will he stored in the gIobah ML-yl, K-yZ, and nL-y3. To tcst the pmgram, qrcate a variable junk equal to I and speciry junk as the variable narne for the log-likelihood WQ- tributions, pawing values of zero for all linear predictors. Thp r a ~ l t should be illat jimk equals l11(.5~) = -2.0791115 for aL obervations.

2. Estirmta the model. If non-mnwrgence occurs. change the starting values and use the option search(noresca1e) in t k mL maximize rnmmand. Awming that onn population n-iI be more d d i q i ~ m t than the other. rmqonable startirig m l u ~ arc ne~ative values for a,; m d praitive values lor u,, (or v i e versa).

3. Interpret the estimates. 1. Modify the ml model comrrland tu allow the latcnl class m-

b~rship probability pl to dcpcrid on the covariates e t h a i c i e and age via a lugi~3ic regression model.

5, Interpret tlic estimatw.

Page 285: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 14

Principal Components Analysis: Hearing Measurement Using an Audiometer

14.1 Description of data The rlata in Tahlr 14.1 arr atlaptrd from tlime given in .Jackson (1901), ~ n d relate to hearing rneastlre1nent.j with a11 i~lstrnu~erlt c>tllerl arl au- I~o~ricter. An iridividual is expwd to n signal ol' n givcn fwqucncy sith BTI inueasirig intensity until the siglal is perceived. The low- :+t intensity at tt:+lich t h r signd ih perceived 1s a rrieasure of learing .ws. calibrated in units referred to as dccikl Ioss in comparison mit.h : rrf~rrcncr standard for tlrn!, particular iiistritment. Ohsannt,ion~ me 'brained one ear at a time for a number of frequencies. In this ex- =!nplc, the fl~quct~cics used wrc 500 WA, 7000 W/i, 2000 WL, and 4000 Y7. Thc limits of thc instrn~ncnt arc -10 to 99 decibels. (A ncgntivc aluc does rrot irriply better t11m avera.gc hcnr i~q; Ihc aurliorrlcler had calibration .'zero". and t h m ~ nhsemtions art: in relation to that.)

n b l e 14.1 Data in hear.dat (taken from fackson (1991) with permission of his publisher, John Wiley & Sons) ld 1500 nono I2oon 14000 rZW 11000 ~ 2 0 0 0 r4000

1 II .i I l l I S #I i I .i

Page 286: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf
Page 287: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Table 11.1 Data in hear.dat (continued) 71 0 l(1 4n 60 -5 0 25 50

14.2 Principal component analysis Principal compone~it analysis i s one uf the oldmi hilt still most widely tiscd Lechniques uf rriultivarinix analysis. Originally introduced by Pcai- son (1901) ar~d indepenrlmtly by Hotclling (19383, the basic idea of the inctliod is to t , q p to describe tllc variation uf thc variahles in a wt uf ~iiultivxiate data M pmi~noniously as posviblc using a set of derived uncol~elatcd variables, ~ x c h of which is a particular linear combination of thost! in thc origind data. In 0 t h words, princlpd cornponcnt anal- ysis is a trmformation from the observed variahles, yl,, . . . , yp, to ncw rariables a],, . . . , z, W ~ P ~ F :

The xicw vaariah1ts are deriml in d c c r ~ ~ i n ~ orcler of importmlce. Tllc coclficients all to a , , fur tkc Arst principal component are dexived 9 that tthc sainyle variance of y,, is as large as possiMc. Since this ~ariance coulI(1 bc incrcosed indefiliitely bv simply incrr~sing thc co-

Page 288: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

eficicnk, a restriction mnmt he ptamd on them, gmerdly t-hat their sum of squares is one The coeffidents defining the second principal wmpor~er~t qd, are determined to mfirtiIIIi7~ its sample mriancc subjccr to the ccoustrdnt that the am of squared coefficients equals 1 and thar the sample rx)rrel;tt.ion between y,, aiid y% is 0. The other principal rnmponents are defincd simihrly by requiting that they are unmrre- lated with nll prcvious pr~r~cipitl corporlents. It (:an Re shown thar thc required coefficie~ltu are given by the eige~lwctors of t.he sample wvarinucc matrix of yl,, . . . , yF, and their varimm are giwm by the corresponding cigcnn~luw. In practice components are often derived from the correlation mat.rix instead of the C O V W ~ M C ~ matrix, partic- illxrly if t,he var iah l~ have vcry diffcrcrcut scalcs. Thc analysis is then cqu~valcnt l o calculatio~i of the rornprjnmnks from the or ignd varial)lw after these have hem sfanriardined to unit mrianrs.

The wual objective of this t y p ~ of analysis 1% bn wsws whcthw thc first few w~riponents ac~ount, for x siihstantiaE proportion of t h ~ vnriation in the data. If t h y do, t.h~:y c a n he i~rpd to summarize the data with little loss of information. This may be useful For obtainin: graphcal displays of the mnltiv-aciatc data or for simplifying subsequen- analysis. The principal cornironrats can be int~rpretlld k)y inspect in^ tkc cige~lvedurs defining them. Uere it is often ~~,ref ld to multiply the elements by thc squnrc root of t t ~ e corrwporldirig eiger~vdut! in whiff: rase the coeficicnts rcprcscilt mrrclatiorls b e t m ~ ~ arl oljwrved variabI~ and a conlponcnt. A dctnilcd accourlt of principal corriporlcrlt ruidyl?rp% is givm in Everitt and D~mn (2001).

14.3 Analysis using Stata The d ~ t a can he read in from ail ASClI file Ilear.dst as follow:

i n f i l e i d 1500 11000 12000 14000 r500 rlOOO r2000 1-4000 /// using hear. dat , clear

summarize

(sw Display 14.1). Before iindertaking a principd component analysis, some graphid

exploration of the dat,n may be us~fill. A s r ~ t r q b t rnat.rbx, for exam ple, with points labeled with n sul>jcct's identification nnmllcr can k obtained using

graph matrix 1500-r4000, mlabel(id) msymbol(none> /// rnlabposition (0) half

Page 289: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf
Page 290: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

The resulling d i i ~ r a m i s shmn in Figure 14.1. Thc d'iagrmn looks i: little "odd" due to the largely '*diwxetc" 11at11re of the r>bseruat.iors. Sort. ol the individual scatterplots s w s t that some of the obmn-a- tiom might p ~ r h a p s he regarded aa outliers; for example, individud 53 in the plot i~lvolving 12000, ~ 2 0 0 0 . This subject has a smre o l .5tr at this frequency in the left car, but n score of -10 in thc right ear. It might be a p p r o p r i ~ t ~ ta remove this subject's ohwrmtiom b e f o ~ furtllcr analysis, but we shall nnt do tlris md will mntir~uc tn use the data from dl 100 individuals.

- -

As mentioned in the prcviow section, principal rm~rlponents may b ~xtractcd from eithcr thr rnvzria~lce matrix or t.he correlation matrL.:

Yariahl~

i d 15W

l lOW 12000 14000

of the originel viuhblcs. A r:hoiicc neerk to be made since tlerc is no: ncccssarily any simple rclhtinnship between the results in B H ~ I c m . The a ~ r n m y t.able showy t.h& tkc v w k c c s of thc obs~xvations a: the hihcst Frequericics me approximately nine tirncs t.hm at the low- frequencies; curiscquently, a prillcipa1 compouent analysis using the cmxriancc matrix would he domi~la~ed by the 4000 Hz frequency. Bu: this frrqr~ency is nut more c l in idly important than the others, and so. in this case, i t scemq morc remnnablc to use Ihe correlnlion matrix rr. thc bmis of the principal wmponent analysis.

To find the currclalion matrix of thc data requirrx the follow in^ instruction:

mr Mean S t d . Dav. Kin M u

100 50.5 29.01149 I 100 100 -2.8 6 408643 -10 15 100 -.5 7.571211 -10 20 100 2.45 11,84463 -10 50 100 21.35 19.61569 -10 70

correlate 1500-r4000

and the rcsult ix Gvcn in Display 14.2. Note t h t the highcsl correla- lio~is occur bet.nwn adjacent frequcncies on 'he samc ear md b c N r correspord~ng frequcncies an difTer~nt ears.

Thc pca corninant1 c m bc ~ ~ s r d to ol>tnin thc principal components of t.his correlation rnil7,rix:

rM10 r1000 ta000 r4000

100 -2.6 7.123726 -10 25 100 - . T 6.396811 -10 20 100 1.55 9.257675 -10 35 100 20.95 19.43254 -10 75

Page 291: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

1 .ow0 0.7775 I . WOO 0.3247 0.4437 1.00W 0.2554 0.2749 0.3954 I.0000 0.6963 0.6515 0.1795 0.1790 r.ooao 0.6416 0.7070 0.3532 0.2632 0.6834 1.0000 0.2399 0.2606 0.5W10 0.3183 0.1575 0.4151 1,0000 0.2264 11.2109 0.3588 O 8783 0.1421 0.2248 0.4044

r4000 - 1.0000

Display 14.2

pca 1500-r4000

~11ich giws the res~llts shown in Display J 4.3. (The principd mrnpo- ner~is of tlic cmriance matrix can he obtained using the covariance optio~~.)

An informal rule Tor choosing the r~umhw of ournponent:: t,u repre- sent a set of correllationb: is to usc only those comgonents with eigcnd- rles grenter than OIIP, i.e., thoae with mrimcca greater than f he average. liere, this lcads to ret,ai~~ing only the first two rumponenls. Another In- formal indicator of the appropriate riurnber of mrnponents is the scree plot, x plot of the eigcnmlucs against. their rank. A scrw plot my be obtained using

screeplot

with thc result shorn in F i l r ~ 14.2. The number of ciganvalues above a disti~ict '%llbow" in the smce plot i s usually taken as the number of principd components to select. From Figure 14.2, this would again appear t o hc two. The first two ramponenis account for 68% of t,he vxiance iir! Lhe data.

Exarr~ining the e i p n v ~ t o r s defining t i~c first Lwo principal compw nmts, we ~ C C that the firsl w~ount iry: for 48% of Lhe variance ha< coefficients that arc all positive and all rtpp~oximately thc s m e si7x. Thin principal culnponent, mcnt.ially represents the ovcral.ll hcaring loss of ,a subjecl and implies that individuals suffering hearing loss aL certain frequw1r:ies will bc more likely lo nufler this lms nt other frequencies as d l . Tlze second component, atcounting for 20% uf the variance, contrasts high :hr~quc~~Cies (2000 H.,: and 4000 Hx) m d bn frcqi~endcs (500 Ex and 1000 HA). It; is well knnmm in the case of nom~ol hearing that hearing l c e ~ as a Sunrtion of age is first rloliceai>lc in thc highcr frequencies.

Page 292: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Rincipal cmpaents/correlatlon Number of obs = 100 Number of camp. = Trace

Rotation: (uprotated - principal) Rho

Component Elgenualue D i f f srance Pmportion Cumul

-1 3.82375 2.18915 0.4780 0.4789 hP2 1.63459 ,725552 0.2043 0.68a3 -3 .909042 ,409528 0.1136 0.7959 w4 .499514 ,122081 0.0624 0.0584 C a p 5 -377439 .038$M1 0.0472 0.9055 Ccmp6 ,339098 .0780871 0.0424 0.9479 -P7 ,261011 105451 0 0928 0.9806 Comp8 -16556l 0.0104 1. OOW

Principal cmpnnnts (eignnvecrors)

Variable Camp7 cornpa Unmxplamsd

1500 0.2828 -0.60M 0 11000 -0.0292 0.6133 0 12000 13.2793 -0.0640 0 14000 0.9354 -0,0298 0 rSOO 0.1275 0.3860 0

r1000 -0.4618 -0.3P28 0 r2000 0.4476 0.0293 0 r4ODO -0.4709 0.0747 , 0

Display 14.3

Page 293: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Scree plat of eigenvalues afler pca

0

0 i: 4 8 8 Number

Qure 14.2: Scree plot.

Scora for each individual on the first t.m principal corrlponents niiht be used as a rnnwnirat u7ay of summarizing t.lic clrigirial eight- dimensional data. Snclr s w ~ arc obtained by npp1yi11g the elc~nenta of the mmponding elgcnvcctor t,u the standnrdiGed va111a of the orig- inal okwttions for a r ~ individual. The nccesary raIculatiu113 can bc carried out iwing the predict ca~nmnnd with thc score option:

predict pci pc2, score

(SFC Display 14.4). The new variahlcs pci and pc2 couLain tlc soores for the first two

principal components, and the output lists the co&~cients used to forrrl r h a ~ scort?~. For pritlcipd cnmporjent analysis, tllw coefficients m just the F ~ Q ~ C I I ~ Y of the cigenvedms in Display 14.3. The principal component score. ism he used tu producc a useful graphical displav nf rlic data in a single scatterplot; which may then he uwd to st?arch for st.ruct.ure or palterns in tllc data. particularly thc prmncp of d u s t c ~ of obsmvationx (see Evcritt et ul., 2001). 811~41 a priricipal cornpont.ait plot i s obtained usirrg

twoway scatter pc2 pcl , rnlabe'l ( id)

(?rote the scoreplat comrnand can bc rscd to produce the name graph a-ithout first storing srorw as ~icw mriables in tile d r t tm) . The rcsdt

Page 294: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

I

290 W A Huradhok of Stshstica1 Ad$rsm Uhing S t ~ t a -- . -

(6 components skipped) Scaring coefffclents

am o f aqnares(co1umn-loadlngl - 1

Variable

1500 11000 12000 14000 r500 ~1000 r2000 r4000

C q l camp2 C-3 Cc+ CwpS C-6

0.4091 -0.3135 0.1359 -0.2722 -0.1665 0.4168 0.4242 -0.2301 -0.0933 -0.5528 -0.4998 -0.OW7 0.3271 0.3007 -0,4777 -0.4872 0.6033 0.0404 0.2850 0.4488 0.4711 -0.1796 0.0990 -0.5129 0.3511 -0.3874 02394 0.3045 0.8283 0.1776 0.4160 -0.2867 -0.0568 0.3645 -0.0881 -0.5446 0.3090 0.3228 -0.6384 0.5169 -0.1623 0.1255 0.2696 0.4972 0.4150 0.1976 -0.1757 0.4569

-0.0293 0.6133 -0.!&'93 -0.0640 0.4354 -0.0298 0.1275 0.3660

Display 14.4

Page 295: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

irig dingam is shown in Figurc 14.3. Ha, the variability in Jifcrratial henring lms fur high versus low frtquendm (pc2) is greakr arnung s u b jwts with higher overdl h w i n g luss, .as would bc mpccted.

Note that the distar~ocs between o k m i i o n s in this gaph approxi- mate tlrc Eiiclidean distauccs between the (stiterndardi~d) variables, i.e., t l ~ c graph is a multidan~en~iunml suabng solution. In fact, the graph j.s thc clabxical scaling (or principd coordinate) s e a l ~ ~ soli~tion t o the Euclidean d'iutanws (scc Everitt and Uunn, 20(H, or Everitt and Ral~c- Hesketh, 1897). If other wriald~x such as age were available: it would IIP i~~krmt ing to invebtigak tieir rela,t1onship with t,he principal corn- poner~ts (see Exe.rrisc 14.2).

Figure 14.3: Principal, compor~cnt plol.

14.4 Exercises

14.1 Hearing measurement using an audiometer

1. Rerun thc principal wrnponent analysis dwcribed in this chap- 1 tpx using thc otrvarimce matrix c>f Ihe obsemtions. Compare

Page 296: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

the rmults with those hascd on the correlatb~ rr~atrix. 2, 1nt.erprct cclmponents 3 through 8 in Ihe principal components

analysis bnscd or1 the cormIation matrix. 3. Create a xcrtttcrplot matrix of the lirst five principal conlpo-

nent =orm.

14.2 . Determinants of pollution in U.S. cities

1, Apply principal oompo~ient analysis to tllc air pallutior~ data aiialyeed iri Chapter 3, cxcb~ding the vmiahlc 802, and plot thc Hmt. two principal componeiks (La., tllc twtrdimc~~qiona? clawicai scaling salution for Euclidea~~ distanr:e* bctwan stnn- dardized varinblrs).

2. Regrrss so2 on tlic timt two p~*i~lcipd oo~nponants WKI d d a line mrrcsponrling tu this rcgrcwion (thc direction of s t e p cst incrca.sc in so2 prcdirtwl by t l r regrcsvion pbnc) into tlw n~ultidi~ncrisiwal scnling snlution. I f thc principnl compw r~er~tn RTC dcrlof~l p1 pl2 an$ the r~rrmpundii~g %titinrated rcgrcssion coc&ficiank~ f i ant1 A, nrld the linc pl versw p1 t.o the sctlttrrplot of vcmun pi.

14.3 Characteristics of criminals

The rorrclntion ninlrix h~lotv wwau calculnkd from mcuszwemcnts of sevci~ physical charmterislics it1 eacll ol:1000 convirterl crimi- nals (Miu:Donncll, 1002).

Thc chwwferistics were ( i t1 the trxlnr order for Ihr correlation rna1.i-ix) (1) Head Icok4Iq (2) Hcad hrentltli; (3) Ywe brettdt.h; (4 Left finger length: (5) Left Lorcarrn langlh; (fi) Left foot length (7) Hcid~t . 'rhc corrplakiorl mxt.rix is contined in thr ASCII file criminals. txt

1 . Find t,hc principd co~rlponents (Hint: rnilvcrt the data to a matrix using m k m a t and then 11% Llrc pcamat corximand witt the names0 option).

2. Plol a scree plot and r l i s c w hnw many principal cvrnponents

Page 297: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

appear to he q u i d . 3. Interpret the principal cornponenls.

14.4 Nutritional contents of food

Hafligm (1975) providcs data, on the nutriliond content of dif- ferent foodstuffs (the qimntity irivulved is always three ounces).

The rariables it1 nutrition.dta arc:

8 food: type of food (string variable) m energy: energy contcnt (cakmes) in 3 ounoes of the fi)odstuff

protein: pi-otcin (grms) in 3 onnc:Ps of the foodstuff II f a t : fat ( p a m a ) in 3 OUUE:, of thc Sood..t~~ff

calcium: calcium (rr~illigmns) in 3 ounces of Lhe foodstuff i r o a iron (rnilligrmns) in 3 ounccs of the foodstuff

1. Create a. scatt~xplot matrix ofthe data labeling tlic foods%uffs appropriately in each pwd. Use only the first two characters of thc strinp in food a lahels.

2. Or1 the basis of this diagram undertake what you think in an approprialc principal con~pon~nts analysis.

3. Pruduc~ a priudpd rnmpo~wnt plot with two-charmtcr labels For the foodstuffs.

4. n . v to int~rprct the first two principal components.

Page 298: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter I5

Cluster Analysis: Tibetan Skulls and Determinants of Pollution in U.S. Cities

15.1 Description of data Thc first set of data to be wcd in this chaptcr is shown in Table 15.1. T h e data, collectd by Coloucl L.A. kIiarld~J1, ~vcrc first rpported in Murant (1923) nnd are also given in Nand e t al. (1994). The data consist of five memuremnentu or1 cnch of 32 skulls fourid irr lllc so~it,h- wrsl,Prn md easterr1 districts of Tibet. The fivc mexwrementu (dl in millimeters) iirc as follows:

y2: greatcst horizontal breadth of skull

y3 : Icigl~l or skull

w y4: upper face length

w y5: fate breadth, bctwccn ontermost points of cheek hones

The main question of inter& about I~EP data is whether t1erc is ~ 1 1 - j evidcncc or clihent. t y p ~ s or clmscs of sktlll.

Thc sccontl s ~ t of daka that wc shall analyze in this chnptcr is the air polllitinn data introduced prcviolrsly in Chapter 3 (see Tahle 3.1). Here wc shall investigat,~ whether thc clWtem of Citmes tound nrc predictive uf d r pollution.

Page 299: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Table 15.1 Tibetan skull data

Page 300: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

15.2 Cluster analysis Clrlster walj'sis is tr generic t.erm for a set, of (Jwgely) explorttt.nrp data andysis techaiquex that se~k to ii~imrer groiqas or clusters in dets. The lerm exploratory is import,ant since it cxplttins the largely ab- sent. "pvdue", ubrquitans in many ot,her arew of statistics. CIusteririg methods are primarily intendm1 for gene~atirig rather than tcsting hy- yothescs. A detdhd accnr~nt of what. 1s now a very large m a is g i v ~ r ~ in Everitt e t QI. (2001). The most comlnonly nsd class of dusrtcritrg methods contair~s tho.-

ir~ct~horls t.hmt lcad ta a scrim of n w t d or hiwarchical cl~lssifications of ~ht: observations, beginning at the stage :c:r!iere each observation is rc- gardcd forrnirlg a sir~glemember "t:lustitcr" and cnding at the stage where all the obmwtmils are in s single cluster. The corr~plete bi- erasthy of snlut,ions cxrl bc disp la~d as a tree rlingram lmo\wvrl as a dcndrogram In practice, rriost uscrs will bc i r i t~mtcd not jn the whole dendrogrm~, but in wlclecli11g a pxrticukr number of clusters that i s op- ~irnal in .some sellfie for the data. This cntaiis "cutting" the de~~drogrsm at some particuhr Icwl.

5Imt llierarchical methods operat.? not on thc raw data, but on a11 inter-individual distuncc rnatrix calculated from the raw data. The rnnst commonly usod djst~11w rr~eastm iCi EucLi(1ean and is defined arj:

wharc g,, to y,,, are the r v x i a h l ~ for individ~zl i . A w r i b y of I)ierard~ical cl~~storing ~ e c h n i q i i ~ arisc because of the

d icrent ways in which the divtmce hetwecn a, cluster ~3ntainin.g WWBI obsenntiow and a Pj~lgle obswvatioa, or bctmwn two ulrlut~m, cul be drfin~d. The inter-cl~~gter rlistenccs uwd by thrre corr~rnonlp applied hierarchical dustering twhniques are: I Singlc linkage clustering: dktanee between the closest pair of

ohenratinns where one rntnnbrr of tlm pair is in the first c l u s ~ t I and the other in the second cluster, and

I . Carnplcte l i n h o dus?,cring: distanw htlween the most rernotc

i pair of observations whcrc one member of the pdr is in Ihe first tJustcr anti thc other in the scrond clust~r.

m Averwe l i k e : average of diutxr~ca hctnw11 all pairs of ohscr- cations wherc one membcr of the pair is in the first duster and t,hc other in thc necand cluster.

An alternative appronch to cluxtcring lo lhat provided by the hi- c r ~ d u c a l methods described above is k-means clu.stcrizg. 1Iei-P the dntn arc partitioned into a gpwified nlmher of groups set by thc 1 1 ~ ~ 1

Page 301: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

by an itcrntiw prucess in %Rich, starting from an initial xt of dust= mmns, each okwdtion is placed into the p u p to u~hnse mcan vector i t is r I m l (generally in the Euclidcar~ sense). After each iteration. n m poug mearls are ralculatcd and the procedul~ repeated uutil no obscrvatiunq rl~ru~ge groups. The initial group m e w car1 It: chosen ir? a. variety of ways. In g~neral, thc method is applied t o t.hc data for different numbers of goups and ther~ an attempt, i s rnadc to select the number of LTaups that prm~idcs the b a t fit for th~ : data.

Important issues that need to be cmlsidered when using cIusterine in practice include how to scale the variables before calculatk~g the chosen distance matrix, d u c h particular method d cluster analysis to w e , and how lo decide nn t.he appropriate numbex of groups in thc da ta Thcsc m d many other practical problems of rlirstering wc d i s e d ir Evcritt ct al. (2001).

15.3 Analysis using Stata

A ~ ~ i ~ m i n g the data in Table 15.1 asc contained in a file t ibetan. dat thqv can be rend iuto Stata using the iristruction

i n f i l e yi y2 y3 34 45 using tibetan-dat, clear generate id = -n

Here rve haw also generat4 an identifier .c-miabIe i d for the skulk. To hegin it is good practice to examine some g~srpliclll displays of tk data. \lritlr rnullivzriate data si1c11 1t4 the measurements on skulls ir Table 15.1 a sr~tterplot matrix is often helpful and CZI be generate: as fo l lm:

graph matrix yl-y5

The resulting plot is show in Figure 15.1. X few of thc individui srattcrplots in P i w e 15.1 arc perlap suggs%ive nf a division of t b ubsemtions inlo distinct gr011ps, for cxnnjplc that fur y4 (irpper fw height) vm11s 95 (fmc b~mdth).

U'e shdl now apply earh of single Ifnkagc, complete linkage, a ~ i average linkage clustering to t.hc data using Errclidean distnncc a s t b basis of each analysis. Here the five me~uremeirts arc all on the mm= scale. so that standardization before ca1~11Iating Lhc distance mati-% is probably not nmdrd (but see the analysis of thc air pollution da:: described later). The necessary Stata co~~imands are

cluster singlelinkage yl-y5, namelsl) cluster dendrogram

Page 302: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Figilre 15.1: Scatterpht matrix of Tihetan skull dnt,a

cluster completelinkage yl-y5, name(cl) cluster dendrogram cluster averagelinkage yi-y5, aame(al1 cluster dendropam

Here thc name() aplion is used to attad1 a name to the rt=c~ilts from cach cluster analysis. The resultin# thee derldrograms arc shuwn in Figurw 15.2, 15.3, and 15.4.

The single linkage dandrogwn illustrates one of t,he common prob l~nrs with this technique, ~ i m l y its tcndax~cy to inrmrporate o b w - rat~ons into sxistil~g cl11sters ratkicr than begin new ones, a prop~xty gen~rally referred, to as chaining (we Evmitt e t al.. 211U1, for fill1 dc- tnils). The complelc lirikage and average linkage dendrngrams s1iow uwrs evidence of cluster :rtrucqwr in the data. although ths structure appcaw to he different for c ~ h method, a point w~ shall klvcstigalc later.

Irl mod applications of clushr analysis the researcher will try tu

1 rlderrnine the solution with the optiml nurr~bber of g~oups, i.e., t,he number oT groups thal "bet" fits the dat-a. Estirnnting the nurnh~r ol qoups in n cluster mdysis is a difficult problem witlwul n completely zati~factory ,wlution - see Evcritt et d. (2001). TIM stopping rules arc prmided in Stata, t l c Califiski .kid ndar%ba?.sr: ppscudo F-statistic:

Page 303: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

500 W A Haradlmok of Stdisticd Anudrjse~ Umnq Stata

Dendrcgram for sl cluster analysis

Rgr1i-e 1 5.2: Dendropam using single linkage.

Dendrogram for cl cluster analysis

i 2 b 2 & 4 2 ~ 1 2 ~ i z t ' ~ ' s z ' i 2 ' 2 i i l b i ' 741b i i ~ 3 7 CI1'31'9e1'82$1'11'&

'Figure 15.3: Dendropam using c:nrnpletc linkage.

Page 304: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Dendrogram for al cluster analysis

Figure 15.4 Dendrugram ui11g average linkage.

(Caliliski and Harahasz, 1974) and the Duda md Hart indcx (Diirla ant1 Hart, 1913); m [My cltlster stop for clctails. For hoth the= rnIw, Inrgcr values ind~cata more rlisti~lct. clustering.

n ~ r c we shall illur;tra& the 11s~ of the Duda and Hart index in asso- ciation with thc three clmt~xing Ipxlhniquw ~pplied rtbwc. 'She Stmat& rommnnrls are

cluster stop sL , rule(duda) groups (l/5)

(see Display l5.Z),

cluster stop cl, rule(duda1 grcups(i/5)

(see Display 15.2). and

cluster stop al, rule(duda) groupsIl/5)

i see Display 15.3). Uiskinrl r:lustering is generally corisidered to be indicated large

vxlues of the Durln ard Hart incicx and m l l vdues of the Duda and Harl pscu~do ?'-squared. Adopting t,his approach, the resi~lts from siqqle U11hgc clustering du not suggest ariy distir~ct cb~ster structul.e! largely bccnus~ of t.hc chniuing phenomcnun. The results a%~ocbted with com- piete linkage clrletPring snggmt a f i v e g u t ~ p solution and thrw from rhe average linkage method suggest perhap tlirw or four groups.

Page 305: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Display 15.1

Display 15.3

Page 306: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

ab see which skulls are placcd iri ~ h i c h groups we can 11.. the cluster generate command. For example, to ~xaminc the five goup solution giwn hy oomplete linkage wc me

cluster generate g6cl - groupsCli), nameEcl) sort g5cl id forvalues 1=1/5 C

display " I'

display "cluster '1 ."

list ~d if g5cl=='im, noobs noheader separator(0) I

Hew we 115e a f orvalues loop to lkt i d for each cluster. Thc noobs op tion suppreP';$rs lincnurr~bers; thc noheader option suppresses mriablc nmnes; ~ n d the separator(0) optron suppresses scyarator l~nas. The resulting oiit,put is s h m in Display 15.4. The numbers of ubsprvations in each group can be tabdated using

tabulate g5cl

giving thc table in Display 15.5. It is ohen hclpful to compare thc mean Mars of ench of the cl~~sters.

The necessary code to find thesc is:

table g5c1, contents(mean yl mean y2 m e a n y3 mean y4 /// mean y 5 ) format(74.1f)

Tllc skulls in cluslcr 1 are rharacteriised by hbdng relatively long and narrow. Those in c:Iuster 2 arc, on average, shortcr ar~d broader. Clus ter 3 skulk appear to be particnlarly rrrtl'row, arid thme in cluster 4 havc short upper facc Ier&$.h. Skulls in cluster 5 might, pcrllaps be rnnsidcrcd "average':.

The scntterplut matrix of the data nsed carlicr to dlow a "look" at the raw data is also u d u l for ~ x ~ n i n i n g Ihe mu1t.s of cfustcring the daha. For ma~nplc, wc can produce a scatterplot matrix with o b s e m lions identified by their rlnster aurrll~er from the thrcc group nnlution ftom av~rngc linkage:

cluater generate g 3 d = groups(3), name(cl) graph matrix yl-y5, mlabel(g3al) dabpos(0) mymbol(i)

(sec Figure 15.5). Thc sepiuittion bclwcn th three group is most dis*i~irt in the pmie1 for grtat.cst length of skull yl vmus fxe breadth ~ 5 .

The data m oriAinally collect.ed hy Coloncl Waddell were thought to consist of Gwo types of skulls: the first type, skulla 1-17, came from g~.avcs in Silckim nnd the neighboring rareas of 'l'ihet. Tltc remaining 1.5 skdls wcrc picked ilp on n bzttlt?field in t,he Lhwa district and wcrc hdievcd l o br t,hose of native soldiers from the emtern previ~lca of

Page 307: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

304 A Hundbnok of Stntidzrd Andg~cs Using Stah - -- - . --

cluster 2

cluster 3

cluster 4

Display 15.4

Page 308: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Freq. Percent

15.63 W.63 12.50 53.13

4 5.26 69.38 5 13 40.63 100.00

Total 32 100.00

Figire 15.5: Scalkrplot matrix with ob,wx~,atiors idmtificd by t,licir cluster numbcr.

Page 309: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

SO6 I A Handbook of StaMca! Analglses Using Stata - . - . . . .

Klwnis. Them shllls were of particular interest because it was thought at the time that T~he tans fmm Kharns m~ght he survivors ot a partic- ular fundamental humm type, unmi~ted to the Mongolian ancl Indian types whl& surrounded them. We r m compare this classification with the t,wo group solutions givcn by tach oE thc Il~rcc clustering roethuds I?y crm-labulating the wrrwpumling wkgoritd wiahles mntining group membership- Tbe Stata code for this is:

generate c12 - cond(id<=lT,i,2) cluster generate g2al = groups(2) , name(sl) cluster generate g2cl * groups(2) , nameIcl) cluster generate g2al = groups(2). nameIal) tabulate c12 g2s1, rou

(see Display 15.71,

frequency r o w percentap

gza1 c12 1 2 Total

1 0 17 17 0 . 0 100.00 100.00

2 1 14 15 1.67 93.33 100.00

Total 1 31 aa 3.13 96.88 1DO.00

tabulate c l 2 g2c1, row

(sec Display 15.8), and

tabulate c12 g a l , row

(see Display 15.9). Thc Iwo group solution from single lir~kage consi~t of 31 ol>suvnlions in one w u p and only a single observation in t5= second group, again illustrating thc chaining problem associated rriiI this method. T ~ P mrnpl~ta linkage solution provides ~ h c clm5t marc: tn t,he division originally proposed for the skulls (with group l a h i intcrcl~angcd).

Page 310: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Display 15.8

c ia

i

2

Total

BZCl I a

9 14 17.65 82.35

10 5 68.67 33.33

13 19 40.63 69.38

Total

17 1 0 0 . ~

16 100.00

32 100.00

Page 311: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

15.3.2 Deteminants of poIEutbon in U.S. cities

In thiv section shall appky k-means clirstrrir~ to thic air pollution data from Chaptm 3. We will use the vvrirthlcs temp, manuf : pop, wind. precip, and days lor the cl~~sler anal~%is. Sincc tticsx! vxriehlm hnve very different rneerics wc shall hegin I)? standardiztng tlicm

infile st110 town so2 temp manuf pop ulnd precip days /// using usair.dat, clear

foreach var of v a r l i s t tmp manuf pop wind precip days I egen s'var' = stdt'var-)

> We will now use the k-mems xlgorithm to divide the data into 2.

3, 4, md 5 groups usirlg the rlcfault for choosing inilid cluster centem namcly the random s~lecbiun of k unique obserxationu frum a~nonf: t h w to bc ~Iusterd. TO he ablc to rep?pnr~ducc the resnlts here: we wt th? random nurnkcr socd using the optmion start (krandomC234) ) . The comma~~ds aye

cluster !means stemp smuf spop svind sprecip sdays, / / / k(2) start (krandom(234)) name(cluster2)

cluster beans stmp smanuf spop swind spreclp sdays, /// k(3) start (krandom(234)) name(cluster3)

clus ter beans stemp smanuf spop swlnd sprecip sdays, / / / k(4) atart(kxandom(234) 1 nametcluster4)

cluster beans stemp smanuf spop swind sprecip sdays, /// k(5) start (krandom(234)) name(cluster5)

tVc can now use the Czliilski and IIarakiasz approach to selectkg ttlc optimal nurnbar of groups:

cluster atop cluster2

cluster stop cluster3

Page 312: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

- L ' l ~ 1 t e ~ Atldysis 1 309 ---- ------

cluster stop cluster4

clusters pseudo-F t+{ - -

cluster stop cluster5

The Iargest d u e of the Caliriski and 1I~zbwx index correspnnds to the four group solut.ion. Details of this solution cnn he fouod from

sor t cluster4 tom forvalues 1=1/4 I

display " '" displa~ "cluster ' i-" list t o m if cluster4-'i'. noobs noheader separator(0)

3 The output is show11 in Display 15.10. Note that Chicago is in H

cluster of i t s own indimling agdn (a? in Chapter 3 ) that this a p p m to be an o ~ r t l i ~ ~ . Tt b worthwilt: repeating the cluster analysis with Chicago r e m d t,o see how much thc composition uf the rcmabtirling clustcrs changes (nee Exercise 15.2): hut hcre wc will continue irltcr- preting the current solutior~. WE cmi use the tabstat cr>rnrnanrl to tnhulatc the cluwter means (thc table commmid can only tabulate up tu five sthtistic~):

tabstat temp m a n u f pop wind precip days, by(cluster4) /// nototal format(M.if)

(sm Display 15.11). We wilI now coinpwc polltrtion levels (the snnunl mean concentra-

tion of sulpl~ur rlimcide so2) between t,hese fiw cluslers ol tcrwrls. Thc means aud stmdard deviations can be tabulated using

table cluster4, contents(mean so2 sd s02) fornat(X4.lf)

(see Display 115.X2). Clust,eru 2, Chirwo. ~ B S extremely high pollutioll levcls ant1 duster 1 has rnrnllch higher yollutlol~ IevcIcls than r.lusteru 3 and 4. A more fornid miai,ysis of diffcrerrccs in pollution lcvels among

Page 313: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Atlanta nouston Jackson

nemphie

Nashville

Mort olh RicbmWd

rnffa3.0 Charlest

c i n c m

netmat HaRf 0rd

IUdlan

tooisv HllVak

Philad Pittsb Provad Seattle StLouis

1

310 W A Handlnrok of . S l ~ t k t i d Amlysa Usrag Statn - - -

cluster i

c l u s t e r 2

F l cluster 3

Da11a% Denvsr

Phoenix

Wichl ta

cluster 4

D~splay 15.10

Page 314: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Display 15.11

S m # q statzst~ca: me= by categorie~ a i r cluster4

Display 15.12

cluster4

1 2 3 4

Lhc dilsters can be unrlertnhn using a one-why malysis of wlancc. (Note that. the variahlc so2 did not cont,rihutc to the du&r analysis. If it had, it would be circiilw and invalid to carry out thc anxb~is of variance.) SYt. will fimt log-transform so2 since the standard dwintions appear to incrcasc with the mean.

temp manuf wand precLp days

61.5 475.7 585.4 9.7 36.7 127.0 60.6 3344.0 3389.0 10.4 3 4 4 122.0 58 5 295.6 479.1 9.3 18.6 70.9 64.2 263.2 476 6 9.0 49.8 113.0

generate Is02 = ln(s02) anova 302 cluster4

(,we Diplay 15.13). The, analysis shows that the dwters d i f h sig- llificantlv in thcir avel.age pollutiun j~vals, fiqB7 = 13 21% fr < 0.001.

15.4 Exercises

15.1 Tibetan skulls

I . Repeat thc anulys~s of the. 'rihct.an skull data dwrihed in this chapter using the hlanhatta~l dis tnnr~ rne~s11re rttlhcr than thc E~rclidratl. C o m p a ~ t t ~ e two sets of rcst~lts.

Page 315: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

512 B A lfaadhok uf Stratittimi Andtjaes U s i n g Slala -- .-

HtmImr of obs - 41 R-squared - 0.6173 Root M E 16.Q578 Adj R-aquaxed = 0.4781

Source 1 P a r t i a l SS df h% F Prob > F

Display 15.13

15.2 . Determinants of pollution in U.S. citiea

1. Repeat the k-mcmls duster analysis of the dr pollutio~l data with Chicago re~rioved. Dow thrs l e d to a very different solution?

2. Investigate the use of other options for deterniining an initial partition when applyir~g Ic-means to the air l>otlution data (still with Chicago rmvved).

3. Gnmnpme thc rcsults from hb~nedims rlt~ster analysis applied to the air pollution data with those from k-mpms (still with Chicago removcd).

15.3 Romano-British pottery

qrbb el al. (1990) provide the chemical composition of 48 spec- imens of Romauc-British pottery, determined by atomic absorp tion spectrophotornct.r;r;. In addition to the chemical cornpositioa of Clle pots, thc k i n site d whirh the pottery was found is also notcd.

The wriablm in pottery.dta are:

no: idcr~tifivation n~lmber of pottery m kiln: idcnlihcntior~ number uf kiln (site) whc1.c pottery was

found m a1203, fe203, mgo, na20, k20, tio2, mno, bao anlor~nt @I'

A1201, Fv2O3, MgO, CzO, N ~ L ~ O , K 2 0 , Ti&, NnO, and BaO respectively

I. Apply k-means clustering to thc derrlicd data remembering thel thc variables me on wry different scales.

2. Usc the Cdiriski and ILarirltbasz appronch br finding the bm uurrlber of rlusters (try h f . t ~ w n Iwo arid srx chlstm).

Page 316: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

3. Once you have rhosen a parlicuInr soiututiori, as,ram whethcr thcre is any xwociat.ion betrven thc kiln site and the distinct cornpositional groups found by clustc~ nual~.uiu.

15.4 Life expectancies

Kqvfitx and FIicger (1971) provide dat.a nn life expcclancy (ill thr 19608) in ycars by mutry* age, md wx. Hew: lifc cxpectmcy refers to the mean riuurriber of years of life for pcople of a given sex who have rexchwl a g~vm ag?.

The variables in lif a .dta arc:

country: country (string w i x l ~ l e ) short: nbbrcvintiorl of country

m mO to m75: mcn, t l g w 0, 25, 50. and 75 xO to w75: womcli, ages 0, 25, 50, and 75

1. Apply complete linkage. average linkage; and sit@ 1i11ke clustcr malyx is and generate variables oI group nlrmhersh~p for the four-c4ustcr s01utio1~

2. Perform principal compoua~ts mrzlys.yis b a d on the comri- anw matrix of thc lifc. mpcetancy variables. Interpret Ihc Erst three cnrnpnnents

3. Plot t h ~ four-group solul.ion using coniplete linkage in Ihc space of thc first two principal romponents using different marker symhnls for the four group and the strings in the v m a b l ~ short as ld>cls. Produce adogous ~ a p h s for aver- age llnbge and single linkage.

Page 317: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Appendix: Answers to Selected Exercises

Chapter 1 1.1. Some data manipulation 2. Asvumiug that thc clnla me stored in the directory c .\user,

cd c:\user insheet using test-dat, clear

4. label define s 1 male 2 female label values sex s

3. generate i d = -n 6. rename vl time1

rename v2 time2 rename v3 time3 or forvalnes 1 = 1/3 C

rename var' i ' time ' i ' 1

i reshape long t i r o e , i ( i d ) j (occ)

3 egen d = meanctime), by(id) replace d = (time-dl-2

9. drop if occ==3&id==2

Chapter 2 2.1 Female psychiatric patients

Page 318: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

1. insheet using fern-dat, clear tab le depress, contents(mean weight)

2. foreach var in iq age weight tab le l i f e , con%ents(mea. 'var' sd 'var'j

3 3. graph bar (count) i d , ///

over(sex, relabel(l. "no" 2 "yes")) /// over( l i f e , relabel(1 "non-suicidal" 2 "suicidal")) /// ytitle(Percentages by group lsulcldal versus not)) / J / asyvars percent showyvars legend(off1

4. search mann help ranksum

5. ranksum weight, by(1ife)

6. twoway (scatter iq age if life==i, msymbol(circle) /// mcolorlblack) jltter(2)) /// (scatter iq age if life==2, rnsymbol(x) / / / mcolor(b1ack) jitterl2) ) , /// legend(order(1 "no'2 "yes "))

7. spearman age iq

8. Save tohe commands in the Review wi~idow and cdit the file using the l b f i l p Editor or any t e x editor, e.g.. Notcpad. Add the corn- ~nnnds aivc~i in the clwfile template in Section 1.11, and snvc the file wlth the extension .do. Run Ihe file by typing the r:o-nd do filennmc.

Chapter 4 4.1 Treating hypertension 1. infila bpll bpi2 bpi3 bpOL bp02 bp03 using bp.raw, clear

N m fallow the commands on pww 88 to 89. Tl~ere is no need to redefine Labels, but il you wish l o do so, first issue the com~nmd Label drop -all.

2. graph box bp, over(drug) graph box bp, over(diet) graph box bp, over(biofeed1

1. sort i d save bp infile id age using age.dat, clear sort I d

Page 319: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

merge id using bp anova bp drug diet biofeed age, continuous(age)

Chapter 5 5 , l Effectiveness of slimming clinics

I . infile m a n u a l exper resp using slim.dat, clear anova resp maaualrexper exper manual, sequential

2, generate d m a n w l = manual - 1 generate dexper - expsr - 1 generate dinter = dmanual*dexper regress resp dmanual dexper dinter

3. xi: regress resp i . m m a l * i . e x p e r

4. char manual [omit] 2 char experComit1 2 xi: regress resp i.manual*i.exper

Chapter 6 6.1 Treatment of lung cancer

1, infile fr l fr2 fr3 f r 4 using tumor.dat, clear generate therapy - i n t ( (-n-l)/21 sort therapy by therapy: generate sax = -n Label define t 0 seq 1 alt, modify label values therapy t label define s 1 male 2 female, modify label values sex a reshape long fr, ictherapy sex) j ( ou t c ) ologlt outc therapy sex [fweight=fr], table

6.2 Female psychiatric patients

1. a. insheet using f em. dat , clear replace sleep=. if sleep==3 recode aleep 1=2 2=1 ologit depress l i f e

b, replace l i f e = l i f e - i logistic life depress

2. Even if wc use very lenient Inc111sio11 and exch~rior~ critcrin,

Page 320: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

stepuise, pr(0.3) pe(0.2) forward: /// log i t life depress anxie iq sex sleep

only depress is aclected. If wc exclude depress from the l i s t of candidate variables, anxiety and sleep arc wlectd.

6.3 Diagnosis of heart attacks

1. infile ck pres abs using sck.dat, clear generate tot = pres + abs expand tot by ck, sort: generate infct = (-n<=pres) logit inf ct ck esta t gof, table

2. generate ck2 - ck-2 logit i d c t ck ck2 generate ck3 = ckA3 loglt in fc t ck ck2 ck3 generate ck4 = ck-4 logit infct ck ck2 ck3 ck4 logit infct ck ck2 ck3

3. estat gof, table

4. infile ck pres abs using sck.dat, clear generate tot = pres + abs generate prop = pre/tot

tuoway (function y = invlogit (-b [-cons) +-bLcN *x /// +-b [ckll *xn2+-b [ck3] *x"3) , range(0 480) ) ///

(scatter prop ck) , x t i t l e (CK3 /// legend (order (1 "predicted" 2 "observed") 1

Chapter 7 7.1 Effectiveness of slimming clinics

~ n f i l e cond status sesp using slim.dat, clear xi: glm resp l.cond f.status, fam(gauas1 l iakcid) loca l devi = ecdeviaace) xi: gIm resp 1-cond, fam(gausa) link(id) local dev0 = eIdeviance1 l oca l ddev = 'dev0'-'devl' /* F-test equivalent to anova cond s t a t u s , sequen Local f = ('ddev'Jl)/('devl'/311 display 'f'

Page 321: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

display fprob(l,3l,'f') /* difference in deviance */ display 'ddev- display chiprob(1, 'ddev-)

2. regress resp status, vce(robust) ttest resp, byIstatus) unequal

7.2 Australian school children I . use quiae, clear

encode eth, gen(ethic) drop 9th encode sex, gen(gender) drop sex encode age, gen(c1ass) drop age encode lm, gen(s1or) drop lrn generate cleth = class*ethnic glm days class ethnic cleth, family (poisson) link(1og)

2. glm days class ethnic if stres<4, family(poisson) l inkclog) or: ~nilrnii lg thc sort order of the data hm not charged, glm days class ethnic if -n!=72, family (poisson) link(1og)

3. generate abs = cond(days>=14,1,0) glm abs class ethnic, family (binomial1 link(logit1 glm abs class ethnic, familyCbinomial) link(probit)

4. glm abs class etlmic, family(binomial) liak(logit) / / J vce (robust)

glm abs class ethnic, fmi ly (b inomia l ) linkIprobit) /// vce (robust)

bootstrap -b [classl -b [ethnic] , reps (500) : /// glm abs class ethnic, family (binomial) 1inkIlogit)

bootstrap -b [class] -b [ethnic], reps (500) : /// glm abs class ethnic, family(b~noaidl) link(probit)

Chapter 8 8.1 Treatment of post-natal depression 1. infile subj group pre depl dep2 dep3 dep4 dep5 dep6 ///

using depress-dat, clear mvdecode _ a l l , mv(-9) graph box depl-dep6 , by (group)

Page 322: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

820 w A H m d h k oJStutistioal A r a c d ~ ~ Using Slah

2. We can obtHjn thc mean over visits for subjccts with complete data using the simple corr~mand (dnta, in "wi(1c" Iorni)

I

generate av2 = (depl+dep2+dep3+dep4+dep5+de~) /6 1 For subjw.ts with missing data, av2 will be mis~ing where&? Ihe egen funct~on rowmean0 would rclwn the 1ncnIl of all available data. The t-tests are vhtainml using

ttest av2, by lgroup) ttest av2, by(group1 unequal

3. egen max = roumax(depi-dep6) t t e s t max, by (group)

4. a. generate diff = avg - pre ttest dif f , by (group)

b, anova avg group pre, continuoux(pre)

Chapter 9 9.1 Thought disorder and schizophrenia I. use madras, clear

reshape long y, i(id) j(rn0nt.h) Label variable month ///

"Number of montha since hospitalization" generate month-early = month*early label define e 0 "Late onset" 1 "Early onset1' label value8 early e

gllamm y month early month-early, i ( i d ) /// link(logit) Samlly(binom) eform adapt

gllapred probl, mu s o r t id month tvouay ( l i n e probi month, connect (ascsndin$l) , / / J

by(early1 ytitleCPredicted probability)

generate cons = 1 eq slope: month eq Inter: cans gllamm y month early month-early, i(id) nrf(2) ///

e q s b t s r slope] link(1ogit) family(binom1 /// eform adapt

gllapred prab2, mu sort i d month

Page 323: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

twoway (line prob2 month, connect (ascending)), /// by(sazly) ytitle(Pr8dicted probability)

9.2 Australian school children 1. use ..\data\quine, clear

encode etb , gen (e th ic ) drop etb encode sex, gen (gender) drop sex encode age, gen(clas8) drop age encode lm, gen (slow) drop lrn

generate id=-n gllamm days class ethnic, i(id) adapt ///

family (poisson) link(1og)

Chapter 10 10.1 Treatment of post-natal depression 1. a. inf ile subj s o u p depO depl dep2 dep3 ///

dep4 dep5 dep6 using deprees.dat, clear mrrdscode -all, mv(-9) reshape long dep, iCsubj) j (visit) generate gr-vls - group*visit xtgee dep group v i s i t gr-v is , i(subj) cor(exch> r e pees dep group visit gr-vfs, vce(sobust) 111

cluster (subj 1

1 h. bootstrap -b[group] -b [visit] -b Cgr-vis], /// cluster (subj 1 reps (500) : / / / regress dep group visit gr-via

Chapter 11 11.1 Edrogens and endometrial cancer 1. i n f ile vl-r2 using estrogen.dat, clear

generate -varname = cond(-n==l, "ncasesl " , "ncasesO1') xpose, clear generate coneatr = 2--Q reshape long ncases , i<conestr) j (cassstr)

Page 324: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

expmd ncases sort casestr conestr generate caseid = -n expand 2 by caseld, Bort: generate control = 2 - 1 * (dummy fo r control: i=cont., 0-case) generate estr = O replace estr = l if control==0Rcasestr=l replace estr = 1 if control==ltconsstrzO generate cancer - cond~control==O,l,O} preserve collapse (sum) estr (mean) casestr , by(caseid1 generate conestr = estr - casestr tabulate casestr conestr restore

clogit cancer estr, group(caseid) or

11.2 Low energy diet and heart disease 1. infile str5 age awl pyl numO py0 using ihcl.dat,clear

generate agegr = -n reahape long num py, i{a$egr) j (exposad)

table exposed, contents(sum num sum py) iri 28 17 1857.5 2788.9

2. Kwpiug thr: dn,Ln rrotrl the previous exercise

xi: poisson num i.age*exposed, expoaure(py) isr testparm -lagex*

Tlic ir~ttr.rwtion i s not st.atisticelly signiGcnnL nl the 5% lrv~1.

Chapter 12 19.1 Retention of heroin addicts in methadone maintenance treatment I. 1% consididpx anync sliIl at risk dter 4,411 days .w heing censored

at 450 days and tllcrcfore 11ced to ~riake the appropriate cI~angw to s t atus and time hpfol*c running Cox rcgrcslon.

use heroin, clear replace statue = 0 if time>450

Page 325: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Appendix: Answers to Sel&d Exercises E 823

raplace time - 450 i f time>450 egen zdose = std(doae1 stsat time status stcox zdose prison c l i n i c

2. TI ie model is: fitter1 rising

generate dosecat = 0 i f dose<. replace dosecat - 1 if dose,=60 & dose<. replace dosecat = 2 if dose>=80 t dose<. x l : stcox i .dosecat i .prison i .clinic, bases(s)

The survival c u m ftlr nu prison reulrd, clinic 1 are obtained w ~ d plottat1 using

generate SO = s if dosecat==O generate sl = s-(exp (-b [-Idosecat-13)) if dosecat==l generate s2 = sA(exp(-b[-Idosecat-213) if dosecat==2 sort time graph twoway (line SO time, connect (stairstep) 1 ///

(line s l t i m e , connect (stairstep) lpat (dash)) /// (line s2 time, connect(stairstep) IpatCdot)), /// legendIorder(1 "<60" 2 "60-79" 3 ">=8OU))

Note that the hmline fii~rvlval rilrw is t,hp survival cilrve for some- on? whose covariates are all zcro. If we 114 used clinic instcad ol i .clinic above, this would have bee11 meanir~gless; we w u l d have 11m1 to mpclnrxti~te $0. $1, and $2 hy -b Ccllnlcl to ralci~late the s ~ n - v i v d r i i m s for clinic 1.

3. Trcaling d m as continuous:

generate clindose = clinic*zdose stcox zdose clinic clindose prison

xi : stcox i.dosecat*i.clinic i.priaon testparm - IdoaX*

4, xi: stcox i.dosecat i.prison i.clinic x i : stcox i.dosecat i.prison i.clin~c, efron xi: stcox i.dosacat i.prison i.cIinic, exactm It. makes almost no difcrcncc which method is used.

5. stcox zdose prison, stratalclinic) tvcIprison1 /// texpI(-t-504)/365.251

estimates s tore modell quietly stcox zdose prison, st~ata(clinlc) lrtest modell .

Page 326: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Chapter 13 13.1 Two-component mixture of normah

2. nlcom &dl: exp( [ l sd i l [-cans] ) J / / (sd2 : exp( [led21 [-cons] ) ) /// (p: invlogit I [la 11 L-cons1 1 1

giving mtimatttes (standard errors) 6-66 (0.82) 7.06 (1.81) for the standard deviations and 0.74 (0.07) for the prubabilits

13.2 Heteroscedastic linear regression

I . The only thing that i s diflcrcnt from fitting a fiornml distribution w-ith conslant mean is that the mean is now a lmrm function of status scl that the m l model command changes as shown below:

infile cond status resp using slim.dat, clear ml model If mixing1 (xb: resp = status) / l e d mX maximize, noheader

2. In linear rebg-ession. the mean s q i d error is q u d to the sum of squares d i ~ i d d IF the degecs of freedom, n - 2. Thhc maximum likelihood ~ t i r n a t e is e q i d to the stun of squares di l idd by TL. 1% can thcrcforr pt the root memi squaw error for linear regresqion using

disp exp (Clsdl C-cons1 )*sqrt(e(M)/(e(N)-2) 1

Note that the standard errors or t.he r~grei5ion ~ o ~ c i c n t s need to be wrrected by the same factor. i.e..

d i sp [xb] -se [status] *s- (e (N) /(e(N)-2) )

Compare this wit,h t,hc rcst~lt. of

regress resp status

3. hpeimt the proccduw abow but replace thc ml model command by

MI model If mixing1 (resp = status) (lsd: status)

Thc effect of status or& thc standard deriatiur~ i s significant (JI = 0.003) which is not too dimerent h m the result of

gdtest resp, bylstatus)

13.3 Three-component mixture of n o d s

1. capture program drop mixing3 program mlxing3

version 9.2 args lj xbi xb2 xb3 l o 1 lo2 1 x 1 162 163

Page 327: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

quietly C generate double ' s l " - exp('ls1-) generate double 's2- = exp('ls2') generate double '53- = exp('ls3') generate double 'd' = 1 expl'lol') + exp( ' l02~) generate double 'p lz = l/'d' generate duuble 'p2- = exp('lo1') /'dS generate double "p3' - exp('lo2')/'d' generate double ' f l - = normalden($m-yi,'xbI','sl-1 generate double 'f2' = normalden($ML-yl.'xb2', '82.) generate double 'f 2 - - normalden($ML-yl , ' xb3', 's3 ')

r s p l a c e ' l j ' = l n ~ ' p l - * ' f l ' /// + -p2'*'f2' + 'p3'*'f3')

1 end

clear set obs 300 set sead 12345678 generate z = uniform() generate y = inwormal(mifom0 replace y = y + 5 if z<1/3 replace y = y + 10 if z<2/3Rz>=l/3 ml model If mixing3 (xbl: y=) ///

/xb2 /xb3 /lo1 /lo2 /Isdl /lsd2 /lsd3 m l init 0 5 10 0 0 0 0 0, copy m l maximize, noheader trace

Chapter 14 14.1 Hearing measurement using an audiometer

1. infile id 1500 11000 12000 14000 r500 rlOOO / / / r2000 r4000 using hear,dat, clear

pca 1500-r4000, cov predict npcl npc2, score tvoway scatter npc2 npcl, mlabel(id3

Page 328: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

3, capture drop pc* pca 1500-~4000 predict; p c l p c 5 , score graph matrix pci-pc5

14.2 Determinants of poylution in U.S. cities 1, infile sttlO town so2 temp manuf pop wind precip ///

days using usair.dat, clear pca temp manuf pop wind precip days predict pcl pc2, score scatter pc2 pc l , mlabel(town) mlabpcs (01 msymbol(~)

2. regress so2 pci pc2 generate regline = pel+-bfpc21 /-b [pcll twoway (line rsgline pci) ///

(scatter pc2 pc l , mlabel ( t o m ) mlabpos(0) msymbol (i) )

Chapter 15 15.1 Tibetan skulls 1, i n f i l e y l y2 y3 y4 y5 using tibatan-dat, clear

cluster singlelinkage yl-y5, name(s1) manhattan cluster completelinkage yl-y5, name (cl) manhattan cluster averagelinkage yl-y5, name(al) manhattan Then plot dsndrogralns, st+(:.

15.2 Determinants o f pollution in U. S. cities 1. inf l le xtrlO t o w n so2 temp manuf pop wind preclp ///

days using usair.dat, clear foreach v of varlist temp manuf pop wind precip days

egen s'v- = std('v') 1 drop if town=="Chicagol' &peatifig the analysis from Page 308 lo 309 (with t l ~ c same rmidom number seed) lcnrls l o a 5-cluutcr solution thc rormer rlirsters 1 and 4 being split. np and cluster 5 nearly remaining in tact.

2. cluster beans stemp smanuf spop swind sprecip /// sdays, k(5) name (cluater5) start (segments)

The start (segments) option splits t,he sunpic into k (hrxe 5) equal ~ M I ~ Y and uscs the meam shrking uducs.

3. cluster lanedims stemp smanuf spop swind sprecip /// adaya, k(S) name(cluster5)

Page 329: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

References

Acock, A. C. 2006. A Gentle lnhd~act ion lo Slala. CoH~gr Statinn, TX: Stath Press.

Agresfi. A. 1984. Analysis OJ Orrdmal Cnt~gna'cal Data. New York: Wilcy.

Agresti, A. 1998. htmdttcfaon to Cot~gwi~raE Data Anolym. New Yolk: Xl'iley.

Agrcsli, A. 2002. C f e g o i + r ~ l Data Amalysysis (Second Erlition). llobo- ken, YJ: Wil~y.

Aitkin, M. 1578. The analysis of unbnlanccd crmsclmqifiratlom. Jour- r ~ u l o i Mae Rovd Sfatistical Society, Srr-~es A, 41, 195223.

all is or^, P. D. 1984. EvmL Hastory AnnIll.sis. Re,qwsseo~k lor Lon,gatza- $anal Evmt Dala Sagc Gniv~xsity Paper Series on Quantitntivc Applications in the Socid Scim1ct.s. Nenbuly Park, CA: Sage.

lfltman, I). G. 1890. P m c t i m l f ut8slacsJor LMdid Rescamh. London: Chapman & I4all.

Arldrewt;. D. F., k Hcrsrllerg, A. M. 1985. Dote: A Collertaon of Proh- krrbrsu from Manp Aelds for the S tuden t and Research Worker. Yew Yo1 k: Springer-Verlag.

Kind~r, D. A. 1983. 9 1 1 t1c variancm of misymptoticltlly uorrrlal cstirna- tors from rulmplex surveys. Ialesnationral Stotistir.uE Reviw, 51. 27+292.

Bonihcc, D. R. 1995. i+qn~rimt~~ratal Dcsi,qn and St~ tMcml Method$. Lulldon: Cl~al~mzn & Ilall.

Box-St~:ff~:~~smeie~. 9. !d.. & Joncs, B. S. 2004. Ewlllent History Modcl i~~q: A Guede to Soci.4 Statislics. Cambridge: Cambridge tmivc~sity I'ress.

327

Page 330: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

328 W A Hw~dbaoh of S t n M r d Andpoes Using Slulo -- .-

Brmlow, N. E., & Chyton: D. G. 1993. Apprmimrltc inrcmlce in gen- cralixd h e m mixed models. J o u ~ ~ Q ~ of Uie Amsremn Statislid A.wmdatton, 88, 9-25.

Brorvn, B. W. 1980. Pr~dielion anal)-sis for binary data. In: 4lillcr, R. J. , Efrou, B.. l3ronrn: R. IV., Rs hfoses, L. E. (eds), B~ostat7sLecs CaqeJKlnk. New York: Wiley.

Calkski: T.: & Harilbmz, J. 1974. A dendrite methocl frlr cluster anal- pis. G"nmm~ndc~Leons in Statrvtecv Thcowj m d MeUaod.3 A, 3, 1 27.

Caplahorn, J , & Bcll, J. 1991. Methadone dos~ge and thc rcterltiun of patients in rnai~~tcrm~lce t re~t~rnmt. ?'hi= Medical Jountal of ATS- Lmlas 154: 195 199.

Chratterjee, S., & Hadi. A. S 2006 Rqwsswn Ann1y.s~ by Example (Fourth Edition). New York: Wiley.

Clnytorr, D. G . , & Zlills, M. 1093. SLalisticuE Models in Epid~minlogg. Oxford: Oxford University Prws.

Clews, M., Gould. W W.: & Gutierrez. R. 20[)4. An Jntrodeacliora t o Sarvaval Analpsts Using Stmta (Rwked Kdattaon). College Station, TX: Stata Prws.

Gulbtt , U. 2002. ModeaeELer~q Binaq Data (S~xond Edition) Lorldon: Chapman & Hdl/CRG.

Collett, D. 2003. Mdellznang Szsrudvd Rota in MdicnJ Hesearch ( S m d Edztaon). Boca R~to11, FL: Chxprnxn & Ilall/CRC.

Cook, R. D. 1977 netoction of influential observat~onu in linear rcgres- sion. T~chnometrics, 19. 15-18,

Conk, R. D., & Wcisbera, S. 1982. .Re.&d?~d~lfll,s and Tnflucracr: in Regmq- ,-on London: Chap~nan O Hall.

Cook, T. D., & Cmpbetl. D. T. 1979. Quma-Experimcntataon. Boston: Hougjitor~-Mimin.

Crx, D. K.; t Soloniou, P. ,J. 2003. Components of Vancsnce. Boca Ratnn, FT,: Chapnlan & Hall JCRC.

Cox, N. J. 2002n.. Spcakirlg Stata: How to face lish with fortitude. The Statra Jrntmal, 2, 202-222.

Cox, N. J. 2002b. Spaking Stata: How to movc step by: step. T h e Stutu . J U P L T R ~ ~ , 2, 86 ,102.

Cmi~chley, R., Sc Dnvies. R. B. 1999. -4 rainparison ol population a\?- wage and random effects models for the analysis o l lorig~tudinal connt data with Im.sclirle inforraation. JownnX of t h e Royd Sto- taxtacal ,Socis%, S&cs A: 162, 331-347.

I

Page 331: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

References . 330 -------

Ditvis, C. S. 2002. Stntishcd Mefiocds for the Andy~& o j Repelated Mmv~remerntb: New York: Springer.

Der, G., & Eucritt. B. S. 2002. A Ha~tdbook o j Stuti~tim1 A r ~ a l y ~ e s asang .SAS (Second Editior~). London: Ch~pman & HilIlJCRC.

Dickson. E. R.; Grarr~bsch, P. M., Fleming. T. R.: Fisher: L. D., & L w ' o r t h y , A. 1989. Prognosis in primary biliary cirrhosis: Modcl for dccisir~n making. IIepatoEoyj, 10. 1-7.

Diggle. P. J.! Beagerty, I?, ,I., Linng, K.-Y., & Zeger, S. T,. 2(102. Analysis of I;on,yihdtanl Data. Oxford: Oxford U n i w ~ s ~ t y Prcss.

Duda, R. O., & Hart, P. E. 1073. P a t t m CJ6abuifirfition tsnd Scenc Anlahsis. Chichcster: Wilcy.

Efron, B., & Tibshirani, R. J. 1993. Am M~nduction to the Hootstmp. Lonrlnn: Chapman & Hall.

Everilt , B. S. 1992. The R ndys i~ of Cor~hrigency Tnbles (Second Edi- taon). hndon: Chapmaan & Hdl.

Everit-t, B . S. 199.4. Slati~timl Methods fw M d i m l Invesiigations. LOTI- don: Edward Arnold.

Evcsitt, R. S. 1995. The d y ~ i s of repeated measures: A prxticd review with exmplcs. 3Ke Slahstacian, 44, 113. 135.

fivwitt, B. S. 1096. An introduction to finite rnixlurc diKtributiumq. Sta tutimI Methods iin Mdactnl R~searcl~: 5. 107-1 27.

Everilt. B. S. 2001. Sdatastic.l: for Psflcf~ologLsts: An I n t m c d i a t ~ Course. Mahwh. NJ: Lamxi~cc Erlhaum .

Everitt, R. S., & Dunn, G. 2001- Apphrd Multimriat~ n a t ~ -4ndysm (Spr-nd Edition). Tardon: Ed w,wd Arnold.

Evpritl, B. S., & Pickles, A. 2004. Statwtacd Asperta oJthe Des j n und AnuLysi? of GaeT~ical Mab. Ltmndon: Imperial Collcge Press.

Everilt, B. S., & %>be-Heskcth, S. 1907. The Aaalglsk of Prozirr~ity Data. condo^^: Edward Arnold.

Ewriht, B. S., Landau, S., & Leeae, M. 2001. Cl~ttter A?~u!ys~q (Fourth Edalzvrt) Londo~l: Edward Arnold.

Flcruing, T. R.. & klarrir@on, D. P. 19'41. CotmLing Ppncess and SPLP- 71ival Anul?/.a. New York: Wiley.

G(lldstein, H 2003. Multilevel Statiytirnl Madelu (Third Ed~tim). Lon- don: Arnold

Gould. W.: Pitbhltrln, J., & Sribncy: W. 2006. Masimmn Lzkelihood Eatimnlion wth Sfatck (Thid Edttton). Cnllcgc Stxtiw, TX: Stat,a Prw.

Page 332: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

330 W A Hurrdbook of Stntaqtwd Aladyses llspng Stub - -. -

Grmbsch, P., O Thcrncan. T. 1994. Proportiund hmnrds tests and diagnmtics bayed on wekhted rmiduals. BaometHkn, 8, 515-526.

G r e ~ n , D. M.. Kahl, C., k DiehI. P. P. 1998. Thc price of peacc: A predictive model of UN pcacebeping fiscd cmts. Polaq Studit* Journal, 26, 620-635.

Gregoire, A. J. P., Kumar, R., Everitt, 3. S., Hendersur~, A. F., & Studd, J. W. W. 1996. Trxr~sriermal oeutrogen for the tre~trrlent of severc pwt-natal depression. The Lancet, 347, (33s934.

Band, D. J., &: Crowdcr, M. J. 1996. Pmtiml Lon,qtludtvuI Data Anokysi~. Lwdon: Chapman & Hall.

Hand, D. J., Daly, F., Lunn, A. D., McConumy, K. J., & Ostromki, E. 1994. A Handbook uf Small Unta Sets. 'London: Chapman & Ilall.

Hardin, J . , & Hilhe, J. 2002. Genedizd Eslanaating Eqzlakons. Boca Itnton, FL: Chapman & HaIl/CRC.

Hardii~, J., & Hilbe, J 2(1176. L%neruhed Linear Modeb and Ez1m~don~ (Smnd Edttzon). Collegc Station, TX: Stnta Press.

FIarreU. I?. E, 2001. R e g w ~ i o n Modsfing Slratqies. U'ith Apphcation lo Lmeat- Mdets, Laqistic Rcg7tssaon and Sumval Annlg~t.set. New York: Spring-er.

Hnrtigm, J. 1975. Clmtedng A l y o n t h m . New York Wilcv. Holtbruggs, W.. & Schuinachet, M. 1991. A rmmparison of regresion

models for the mmlysis of ordered categorical data. Apple& Stath- tics, 40, 249 259.

FIot~Jlirig, H. 1933. .4nalysis of a eompIex uf stnlislical variahlm into principal conlponents. Joamub of Ed?~mtdonol Psychology, 24, 49% 520.

Bout, M., Duncan, 0. D., & Sobel, M. E. 1987. Association mid Hclerc- geneneit,~: Stn~tturd Modcls of Sirnilantics and dlffererlcw. Pages 146-184: of: Clam: C. C. (cd), SnciuEor~soal Methodology 1g87. ZiTashin$on, DC: American Sociologjcal .4ssuciatio11.

Hum, M. W., Rnrker, N. W., & Magath, T. D. 1945. The determina- tion of prothrombin tirnc following the administration uf d i r u ~ n ~ d with specific: wferencc to thrornhupkastin. Journal of Lubumtom €4 Clineml Mediczne, 30, 432-447.

Jackson, d. E. 1991. A [Jaer's GzaaldP to PrimcqpaE Coapont:nts. New York Wiley.

Jannrich, R. J., k SrMi~chter, M, D. 1986. Urlbdnnced reprated mea- sures models with urlstruclured covtlriaricc rnatnc~s. Bio7raetPacs, 42, 805-820.

Page 333: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

key fit^, N., & Flieger, W. 1971. Popdalion: Ebrcts and Metho& of Demograph#. San Fr~ncisrn, CA: Precman.

Kleln, 1. P., & Mocschhergr, N. L. ZU03. S u m i ~ ~ d dndysk: Tech- niquw jar Censured and Tmncatcd Data [Second Editaor~). New York: Springer.

Klemchuck. H. P., Dond, L. h., & Ho~vcll, D. C. 1990. Colercncc and t:c~rrt.Jattes of lcvcl 1. perspective taking in children. Mprrill-Palmer &itrterly, 36, 369-387.

Kohler, U.. & Krct~tcr, F. 2005. Data Analqsss U ~ n q Sfntn College St,xtion, TX: Stata Fmq.

Lnchcnbruch. P., & Mit:hy, E. M. 1986. Est.imatio11 of crror rattx in dismi~ninant a1,ndysis. Tprhnometr~ca, 10. 1-11.

Lcwin, A. Y., k Shakun. M. F. 1976. Policy Sciences: Methdolupj and Crises. Oxford: Pcrgamon PTM.

Txwina, It. R.. a. 1981. Sex diffe~~nr~li in sdri~ophrenia: t i ~ n i w or subtypes'? Psyd~olqrlzwl Wclletin, 90, 432-444.

Lialg, K.-Y., & Zegm, S. L. 1986. Longitudind data analysis using generdued Iincm models. Bzornetnka, 73, 13-22.

Lindsey, J . 3.. 1999. Models for Repeated Measesremcnts (Second Edi- tion). Oxford. UK. Oxford University Prcss.

Lindsey, .J. K., & L<wnbcrt, P. 1998. 0 1 1 tkc appropriaten~ss uf m l ~ g i ~ l a l models for repeatcd ~ C I L S U ~ P ~ P ~ ~ S in cli~lical trials. Stntxstics at Medzcine, 17, 4.17469.

Long, J. S . , & Frccsc, J. 2006. Rrgesslon M d e b for Catrgrr~i~ul De- pendent Vwiables Using Stata (Srcor~d Edilfon). Cdlsge Statiori. 'TX: Stata Prms.

hfadtdsm, ht. 190 1 . On criminal arlthropometry and the ider~tification d cri~rtinals. Beomctrikn. 1, 177-227.

Irfdlows: C. L. 1973. Somc oo~nrnent,~ or1 C;;,. Tcclmornefr+m, 15, 6Rl- 687.

Manly, R . F J. 1897. ~ n d o m i s a t i o ~ ~ , Bootstmp and Montr Carlo Metlaods w Baulqy. h n d o n : Chapman 8 Hnll.

Mann, J. I., Inman, TIr. H. Mr., $E Thorngood, R'T. 1986. Oral contra- ceptive irsr in ulrlcr women and fatal 111yocudial infarction. British Medwd Joefrn,al. 2, 193-199.

hlann, L. 1981. Thr baiting crowd in episodcs of threatened nuicidc. J o ~ ~ r r ~ l oJ Personality md Svcalnl Psycholoa, 41, 70,%709.

IvIattIiervs, J. N. S,, Altinm, D. G., Campbell, M. d., k Roystotl, P. 1990. Analysis of *rid niemurcmcnts in m d i d rescmch. B n t s h M~detxI .Journal, 300, 230 239.

Page 334: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

333 . A IIuradluuk of Strrhsttd AaJgwts 1Jxng Stotn -

Maxri~ll, S. E., & Delancy, H. D. 1900. Dest*qfiing E ~ m r n ~ n f s and Analymq Duta. Belmont, CA: Wdwc~rtt-h.

McCullap;k, P.. & NeIder, J. A. il9Hg. Gp.amlixed Laneat- M~od~bq (Set- ond Edztion). Londnn: Chnprrlan & Hall.

McLachlain, G., & Peel, D. A. 2000. Finate M i d m Models. New York, NY: Wiley.

Miettirlen, 0. S. 1969. Individual mnL&ing wit,h multiple controls in the casc of all-or-none r c s p o r ~ . Bxnmeh-ic~, 25, 3 3 M 5 5 .

Miles, J., & Shcvlin, M. 2001. Applving Kkqmfsron and C o m l l a t ~ o ~ London: S a g Publications. I

h4itchell, M. 2004. A Visual Guade tu Stat4 Gmphics. College Statlon, TX: Stata PIWS. I

hiorsnt, G. M. 1923. A fimt study of the Tibetan skull. Biomctvika, 14: 193-260.

Ndder, J. A. 1977. A reformulation of linear morlels. Jouwml of the I

Royd Stathltcd Soce~hj (Sene$ A). 140, 48-63.

Newton, R., & Cox, M. J. (cds). 2006. Thirty-thm Slada Tips Collcge Station, TX: Stnta Prew.

Pcarson, K 1901. On lir~w and plane5 of closest .fit to points irl bpace. Phdosophical Magmine, Sen'ea9 6, 2, 5 5 S 5 7 2 .

Pothoff. RR. I?.: k b y , S. N. 1964. A generalized multivariate dimly- sis ot viriananca model w f n l especially for growth c i m problems. Biometraka, 51, 313 326.

Rabe-Hcsketh, S., & Skrondd, A. 2005. Mullflewel and Longitadinai Modeltry asing Stcsla. College Station, T X : Stata P rw.

Raba-llaske111, S., Skronrlal, A., &Pickles, A. 2002. Reliablc estimateion of gencrdixcd lirienr mixed models using nrlaptivc qundratilrt. The Stotn Jour~~uE, 2, 1-2 1.

Rrab~HeAeth, S., Pickles, A,: & Skrondd, 11. 2 o W Correcting for cmrtfiate rneasurcmer~t error in logislic rcg~euuion using norlpma- rnrtric rnaxim~~m I ikclihoud estimation. SLahCicd ModeIIsng, 3, 2 15-232.

&be-Hcskcth. S.. Shondd, A,, &: Pirklkls, A. 20040,. Ge~iwdizPd mul- tilcvel strllctural cquatior~ mt~dcling. Psychometnkn, 69. 167-190.

RnbeHwbth , S., Skrondd, A., & Pickles, A. 2001h. GLLAMM Mnnq~rnL T K ~ . rept. 160. U.C, Berkeley I3ivisiorl uf Bic- stntidics 14irlhkiug Paper Serics. Duwnloadablc frum http://www .beprcss.~orn/~1cb1~inst~t/p~pwl60/.

Page 335: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

RabeH-kcth, S., Skrondd, A., (95 Pickle*:, h. 2005. Maxirmlm liMi- Iond cstimat~un or limitcd mid disc~ete depenclcnt mrisble mud& with nwtd rmir1011~ effects. Jo4rmml o j Econonaetmcs, 128: 3111 - 323.

hwlings, J. O., Pmtula, S. G., & Dickry D. A. 1998. Applied RP- gres&on Andyns: A Re,~egi-cla Tool ( S m n d Edzthm). New York: Springer.

Rothnlan, K. .J. 1988. Mod~rsz Epddemiolorly. Rostan Little, brow^^ SE Company.

S&tt, D. L., Hayncs. R. B., Gl~yatt, G. H., & n~gwell, P. 1991. CEardtd ~idemiolnqy. Mxwachuwtttta: Littlc Browr~ & Company.

Skrondal. .4.! h hbeHcsketh, S. 2003. Multilm1 loairtic regression frir polytomous data xlld ratkings. P.qchomet,nka, 68, 267-2K7.

Sk~nndd, A,: k Rabe-Hcsket.h, S. 2034. Gmemlk~d Latent VnmabIe ModcARng: M~~ltrIeucl: Longetudinnl, rsrd S fmc tad QIqmdzon Mod- els. Boca Raton, FL: Clutprnfillk Hall/CRC.

Snijders, T. A. n., & Bosker, R. -1. 1999. M~dtilevek Artalysis. Londort: Sage.

SokaI, R. R., k Ruhlf, F. J. 1981. Rzameiy. Sm Francisco: W. H. fiermal.

Sprmt, P., & Smraton, N. I: 2001. Applied Nonpammetwc Statistimi Metho& (Third Edition). Roca h t o n , YL: Chfiprnxn PL Hall / cnc .

StataCorp. 2005n. GetLzng Started ?rPiHh ,!!~tata 9 for Windows Manual. C d c p Stmation, TX: Sta~a Press.

Stat,aCorp. 20t)Sh. Statn 9 B ~ P Rcfewmcr: Manual. College SStt~nn, TX: Statttn Prps.

StatnCorp. 2005~. Slats 9 Data Ma~~ugernent Reference Mnaual. Col- lege Station, TX: Stdn Press.

StataCorp. 2005d. Stata 9 G ~ ~ p h R e s Referam Manuat. CoIlege Station, TX: Stat.% Prm.

StntaCorp 2005e. Stata ,O Longih~danal/Pancl Data H e j t ~ n c e ManeaaL Colkgc Station, TX. Stain P~exs.

St,atacorp- 2OUsf. aata !? P?QqYu?~nbl, Refewnw M(w7inE. Collge St,atiorl, TX: St.ata Prm.

StataCorp. 2006g. Strztm 5) Qaick Rrfimnue and Xnd~x. Cullegc Statiorl, TX: Stata Press.

StataCorp. 2005h. S'lata 9 User$ Guide. College Station. TX: Stata Prcss.

Page 336: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Stock! d. R., W ~ ~ v e r , J. K., Ray, H. W., Brink, J. R., k Sadoff, M. G. 1383. Evaluation of Safe Perjomfince Secondav School D17uer E& ucatior~ C T L ~ F ~ L I T I ~ Demonntmtio~ Pro jec t . Washington, DC: U.S. Dqartmcnt of Ikanspvrtation, Natiund Highway I'rafT~c Safety Adrriinislrrttion.

Thdl, P. F., & W l , S. C. 1090. Some covariance modds for longitudinal count data with r~vrrtlispcrsion. B~orraetrtcs, 46, 657 -671.

Tilara. R., Henrietta, M., Jowpb, A , Rajkumw, S. , 8E Eatorl, W. 1994. Tcn year course uf d~ixopl~renia - the Mndrw Longitudinal sCudy. Act0 PsgchiotPtm Scnndinaviw, 90, 329-336.

Thvrnea~, T. M., & G r m h d ~ , P. M. 2000. Modeling S?~m~dva[ Data. New York: Springer.

van Belle, G. . Fihcr , I,. D., Hwgerty, P. J . , & Lumley, T. S. 2004. Bio- stoti?tdcs: A Methodology for the Health Scncaueu (Second Edzhon] New Yark Wilcy.

van dcr Heijrlen, P. G. M., Mooijaart. A,, & de LWUW: J. 1992. Con- strain4 laterlt budgc~ analysis. Pages 279-820 vf: Clogg, C. G. (d), SocaoiogacaI Mdhodoloq~, vol. 22. Oxford: Rladrmvll.

Vella, F., & Verbeek, M. 1998. Whose wngcs do uniur~s raisc? A d y r m i c model of uniuriisrii and wage rate dcl,erinination for young rrlcrl. Journal of Applierd Economch-acu, 13, 183 183.

Verbeke, G . , & Molenberghs, G. 2000. Linear Mcwd M d e b for h n - gitq~dinal Data. New York, NY: Springer.

L$Tedderbun~, R. W. M. 1974. Qunsi-IihJihood lunctions, geuanlixed linear rnt~dels. and the Gauss-Ncwton method. Btometdn, 61, 439 447.

Williams, R. L. 2000. A note on rob~lsl vtwiance es%imntiorr for rluster- rorrehted data. Biomptn.g~f. 56: 645-646.

Wirier: B. J. 1971. Stutashctsl Pvinciplm in Eqnmental Deuayn. Ncu, York: McGraw-Hill.

Wolfc, J. H. 1971. A Monlr Ca~lo study o j thc samplzng didrebutaon of the likelihuod m t t o for m&ures of vn?iltznormu.l distahtiom. Trchr~icd Bullet~n, vol. STBT2-2. San Diegu: Naval Personnel and Training R~sczxrch Lal~oratttory.

T%roulclriclgc, J. hl. 2002. Economebic A . r ~ u l ~ s ~ of Cmss Swt ion and PnaeE Data. Cambridge. MA: Tllc MIT Press.

Page 337: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Index

accsssing results, 24, 38 .b Is~nmamel, 2.1 e 0 , 36 uIb), 24 e(chiPAls), 216 e ldevlance), 14U e ( d i s p e r ~ p ~ p s ) , 149 e l l l ) , 275 [ugnrsnid -b Iwmapncl , 269 Iwpnrnel -se [ v a m t n ~ l , 269 r 0 . 36 r(meao), 36 r(Var), W'T

adaptlvr qundratnw, 177 adjust~d ti2. 67 addde, 38 algcbra~c rxpmsion, I G tlnral?*;ls of wim~llrw. 8549 - lIl l AYCOVA, see at~nlys~s or rovarinnce ANOVA< see andj.s~s of variance a~ltoregrwslve structirre, 209 werage l ~ n h ~ c , 297 aweights, 23

chi-sq~rued test. 40, 51 clnssiIi<:ation table, 127 closir~g Statn, 8 clust~r andpis, 295-313 n~hort studlex. 222 collinearity, 669 comulnnrl

adopath. 38 aaoua. 81. iU4, 311

regress optron, 107 ssqumtial oplrun. 105

assert. 47 blogzt, 123 bootstrap, see prefix command, bootstrap by, see prefix C O ~ u n ~ n d . + capture. PW p r ~ h mm~naud capture cc, 236 sci, 220 clear, 10 clogit, 235 clustsr averagsllnkage, 298 cluster completelinkage. 298 cluster generate 803 cluster beans. 308

start 0 tiption, 3fl8 cluster sisglel~nkage. 298 codebook, 47 coLLapse, 22, 154, 16% 236 correlate, 54, 67. 286 cprplat. 79 decode. 15 dentring, 1'3 display, 13. 14, 30. 122

33.5

Page 338: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

do, 35, 37 drop, 20,22, tlte egen, 18, 20, 50,167, 248,308 encode, 13, 14d epitab, 228 estat classlf, 127

cutoff 0 option, 127 escat gof, 120 estat phtwt, 255 esta t v i f , GO astat wcorrelation. 207

format 0 option, 207 estimates store, 118. 253 O X l C . Bli

clear c~ption, 9 expand, 22. 125, 233,278 foreach, 48, 74, iT, 308 format, I% f orvalues. :103 generate, 19 gllam, 40, IR3, 185, 192

adapt oplion, 184 ef orn option, 192 e q s 0 option, 183 n i p 0 opt~on, 192 nrf 0 option, 193

gllapred, 185. 192 marg optitm, 195 mu option, 195 u option, 185 u s 0 option, 195

glm. 133, 130, 230 efom option, 148, 2:1(1 family0 option, 139 lmkl) optlon, 139 li&(recfprocalJ option, 155 offset0 option, 155, 230 scale(x2) option, 147 vcelrobust) option. 139

g h , 131) 153 global, .7R graph, 24-30

legend0 option. 26 graph bar

asyvars optirni. 53 legend (off) rBpt luu, 53 ever I) option, 2R, 63 s h o w p u s optinn, 5:I

graph box, 49, IVY, 217

o v e r 0 option, 28, 16'3: 1154 relabel0 optirm, 164 ylineE0) option, 49

graph matrix, 56,64 , 161,284,2'48, 303

j i t t e r0 option. 5.5 mlabel I ) optrnn, R5 mlabposition0 option. 65

graph twovay, 25,56,123,180,212, 249

by0 option, 28 legend0 option. 27 xtitls0 option. 26 27 ylabelo o p t i o ~ ~ . 1SR ytitle0 option. 28, 27

graph tuouay connected connect (ascending) optio~~, 163

graph tuoray function, 124 graph twoway kdensity. 273 graph tuovay llne

connect<ascending) option. 16'2 , corrnect (stairstep) uption. 249

graph troway mspllne, 124 graph twovay rarea, 1M graph twoway scatter. 1411, 284 Ipatt 0 option, 27 mlabel0 optmn, 77 msymbol0 option, 27

help, 8 help uhatsnea, 39 histogram, 273 us, 207 mflle, 11, 64, f%, 104 116. 138.

228,284, 298 insheet. 11,47, 56 ir. 228 xi. 2% kbenslty, 188, 273 keep, 20, 22 ktau, 56 label def ins. 12 label values, 12. 89. 178 label variable, 11 I£, 267 lincom, 179 list, 12, 18, lK2 clean rrption. 162 noheader npt~on. 303 noobs option, 303

Page 339: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

separator(0) option, 303 local, 36. 140 108, 7 logistic, 118 legit, 117 or L I ~ ~ I O I I . 1 18

lookfor, 15 Iroc, 127 lrtest. 110,144, 253 laens. ln matrlx. 31 acci, 281, 232 memory, 11 merge. 22 Ilkmat. 292 ml, 39, 16.1 ml in~t, 271 ml maximize. 'Mi7

noheader uption, 268 trace rq)tir)n, 268

ml model. 267 more, 75 mvdecode. 12. 1 78, 206 mvencob, 12 nbrsg, 133 net cd, 40 net from. 40 net lnscall, 40 nlcom, 272 odbc, I I ologit, 120 outf ile, 11 outsheet, 11 pca, Xrl

covariance option, 287 pcamat, 292

m e s (1 option, 292 pnorm, h poisson. 2:311

exposure 0 option, 230 irr optiou, 2:50

predict, 23, 77, 118,122. 128.149: 180, 181, 216, 2.M, 289

cooksd oplior~. '79 f i t t e d gfit;ton, 183 number r ~ r o n , 128 pearson (@ion, 149 pr option, 119, 124 reifecte option, 183

rstandard option. 77 acore opt~on. 289

preeerue. 22. 162. 295 program, 38 program define, 37, 266 program drop, 38, 208 pvcorr. 54

obs nphon. 51 ~ 1 g option, 54

qnorm, 80 quietly, sea pre6x command, quietly recode, 12, 20, 48, lu6

generate 0 option, 117 regross, 66, 95, 106 raane, I 1 mplace, 10, 19 reshape, 20 21, 88, 116, 162, 178,

190,206. 210, ?a, ZYd, 245 i 0 opt~on. 89

restore, 22, 162. 235 robvar, 5U rvfplot, 74 rvpplot, 74 ampsi, 30 save, 9 scalar, 266 scoreplot. 289 screaplot, 287 search, 8 $errbar, 1A5 set memory, 10 set mora off , 30 set obs, 266 set scheme, 29 sat eeed, 131, 266 sort, 17,163,233, a m ssc install, 40, I83 statsby, wee prdx rmmmand, stataby atcox, 246

basecbazardfi option, 250 bases0 optlurl, 218 usr clptlon, 257 wale l) opt~wl, 2.54 strata0 optlun, 218 t@xpO option, 232 t v c 0 option, 252

stepwise, see prcfix m~ruunnrl. stepwi8e stjola, 255 etphplot, 251

Page 340: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

s t r e t , 245, 254 stsplit, 254 atmm, 213 summarize, 30, 159, 210, 2 R 4 syntgx, 38 sysuse, 10 table, 89, 108, 116, 119, 145, 2n,

803 contents0 rrption, 89 f ormat0 optboo, 145

cabstat, 48, :1IM statiBtics0 option, 48

tabulate, 61, 104, 235 exact option, 52 expected uplion, 52 nofreq option, 52 row opllon, 52

tempname, 268 tempvar, 268 tsatparm, 95. 143 t i a , 2117 ttest, 51, 167

b y 0 option, 187 unequal optiot~, 51, lril

ttesti, 31 taouay, see commxlrc2, graph twoway update a l l , 3'3 use, 9, Ill, 1 14,210

clear up ti or^, 10 vsrsion 9.2.35 xi. are: prcfix commalid, xi xpore, 22,233 xtdaa, 162 ages, 205, 206, 209, 218,216

corr(exchangeable) optior!, 207 correlationi) option. 2 M correlation(unstructured) op-

lion, 209 ef om opt~ou, 2 18 famxlyi) option, 205 10 r~ption, 207 link0 (>))tian, 205 scale lx2) option, 214 t 0 option, 2U7 vce(robust) uptrm, 215

xtlogit, 1%) intpoints0 i~ption, 192

xtmixed, 180

ccvaxiance(unstroctuled) op- tion, 187

mls option, 181 nocons ofl.wri, IHl

xtreg, 178 fe optibn, 200

cmmcntirig ont I~nw. 35 cunplrlc I i r ~ M e , 297 wlripounrl s.yrr~roctry, 2W tm>ditiond Iikclihod, 224 rot~ditiwol I(>gistic. regrpwioo, 227, 232.

235

azilo p~>llut.io~~ liltrr 11olrr. 06 rlolting Ilmm of hklod, 155 crowd wmctlr!ns to tbrcuteri~d xui-

cidr, flu rietcrm~nnm of pollution in U.S.

cit.ips, 81, 82, 292, 308, 312 dimg[lwi% oI hvnlt ntt.wkg, 111,130 driver crlucatlnr~, 210 duretion of LJN p c m b p i n g mr*

siow, XU

efficimty of bycling, 98; cpilytic seiztrws nnrl chemother-

nm, 200, 201, 218 mtroaaus and endometrial cancer.

2313 cxlrwemion uid CHT care, 82 Icmnle psychiatric patients, 4.3, 55.

129 I h~aring mcahurcma;ril using on nu-

diameter, 261, 281 111ducd nh~rl lo~i arid ectopic preg-

nancy. 237

Page 341: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

inwion of acacia trm by ants, 51) jaw gowth. 171, 1Y9 lifc expectancies. 313 lrm slerm diet and heart disease.,

236 ntatmurrl 1d1avior ilr rats, 99 nlortality frtr~n skio cancer. 5H, 83 Ncur Yod Times d c a t l ~ notices, 278 ord cnntrwxptive use and qyocar-

did iidmtion, 236 prstato t!nom. 131 psyd~ialric tzrePuin~ data. 1.70 retention of h m i n arldicts i r ~ methado

n~aintcrian& treatment. 239. 258

~ L t a k i r i g in young children, 109 Iloma~~Briritih potmy, 312 satinfac:tian with housing conditions.

131 scnial satisfaction, 5rJ snrvival of paticntv with primary

hiliary cirrbmis. 254 syxtoiir Llwd pressure: IN Ihougl~t disorder and schiaophw

~da . 17.7, 199, 219 Tibclan xkulls, 295, 311 lmting hy~mtension, 85 tnantrnwl ef Alahcimcr's, 172 trminwlt of lung cmccr, 111,129 twatnwnt of wst-natd d c p w i o n ,

157, 170: 218 t.rentluanl of prartatr: Cancer, 260 w a s i n c r ~ s : 41, 171. 199 wl*t.cr hardness. 83 wave damage to c a r p slips, I 54

data browser, 4 d a t ~ editor, 4 data management, 13- 22 date format, 12 delta methurl, 118, 272 d*ndro~am, 297 deviw~ne, 197 de\-inncc rmirluals. '256 dichotomous, 46 dictionary file, 87 d.crfile, 34, 35 dc-file editor. 0. 35 rluuble, 3% sbragd t ~ p ~ , Jouthc dnmmy varltblm. 1 Od

esen Function g m p 0 , 128 romean0, 20, 107 seq0, 116 stdo, 248, 808 tag0, 128 total(), 2U

aigenduh, 284 nuptnral Rayes, 181 cpldemrology. 221 cqrlalnn, 267. 268 cHimxtion ctnnrnand, 22- 23

111 anwa, s e ct>rnmnod, anova clogit, sce cornmxnrI, clogit g l l a , sw nnnmand, g l l a m glm, r;ec r:ontmand, glm logistic. 8a comma~~d, logistic logit, see cr)mmaod, l o g i t o l o g ~ t . rpe c:ornmand, o l o g i t peissm, svz c~~mmmd, po~sson regress, q c cornrnand, regress B ~ C O X , 5m con~rnand, stcox xtrmred, srr! cotrtrnand, xtmixed xtgee, .we wmn~and, xtgee x t l o g i t , spe command. x t l o g ~ t xtreg, see command, xtreg

Euclidean d~stanw, 3 7

F-tpsl, 103 FAQ, 2 Lillrlr! niixturc d~slribution, 28.3 Fisller's rxlrrt tpst, $2 limvnrd wlacttcan. 71 frailty, 151 f~likctian

chi2tai10, 16, 144, 276 cond0, 2U, 270 da te0 , 13 expo. 16, 269 FtailO, 142 invloglt 0, 125,273 fnvnormalO, 16, ZBG InO, 273 hfactorialO, 278 logo, 16.212 normal 0 ., 16 nomaldeoO,2M s c a l a r O , 2H6 sqrt 0. IG, 289

Page 342: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

CEE, see generalized ~sti111atiq euun- t i n ~ ~ s

genaaljwd wLinlatir~ equations, 206 pteritlizd l~near mixed ~fiorlel, 177 ~~rtwallzetl linear rnrxlcl. 133 153 gluM nnlacro, 267 grhphics. sm corninand. g a p h

h a d fur~ction, 242 hanard ratlu, 243 11clp, 1 hclp BIP, 8 himarrhicrrl surns ol sqws. see s~yueti-

tial mnls of squatex hisMgnm, 273

i~bruediatc! cmunand, -70 cci, sr*: a>rnnlanrl. cci iri, s e ~ cr~mrnand, iri mcci, a& r.ornlnnr~d. mcci sampai, .9w command, sampsi ttesti. ~lw cornrnaud, ttestl

inrida~ice raw. 224 ilnlc:ider~cc rata r;rt,io, 221 indexed variable, 16 informntion ~natriw, 264 initid vraiuw. 273 intexaction:'% imraction diwarns, 91 intern1 a:alc, 46 introclass correhtion, 176

l e s t qrrarm, 63 lcaw on@ uut metl~url. 128 likrlihuad ratin, 117 lined predictor, 134, 226, 268 linear repasion, am mult~ple rcgrmlon link funrt~on,ns, 134 lrlcol macro, 18, 81, 14U, 286 Iq filc, 5

log iikcli11od. 264 log odds, 270 log trmsformatirm, 93- 212 lagfcnl cXptwim1. 15 Iugstlc rtrns iun, 111- 129, 227 logit, F 13 Ir)rgitudkl data, 157-172, 201 Irropirlg, 17 lorvesx, 79

roai~~ effccts, 86 hlunn-h'h~lnry Cr- tmt, 46 Mantel-Hncn.wl estimate, 228 matched t -control stild~*. 226,231 mdchiug, 226 rrmtrix, 31 ntnximum lihlihooticr;Lim~llorl, 204.224.

283-276 MrNcmar's tmc, '22'7 i n m pmfilw, 164 rntszi~lg ~aluas. 158 rnidhlevcl, 177 niull~ple crmlatior~ cocficimt. ti4, 67 nlultiplr r egmion , ti1 82 mn111tiv;trmtu cl~la, 283

ucgptive hinonlil. 1VY N~tCo~lmrs, 2 Uewto-lhphmn d ~ c s i t h m , 2M ~tonnal probability plot, 77

odds, I13 wlds ratiu, 118, 225 (dl&, 213. 225 orduml, 46 ordlnrrl Iodstir: rcgmwon. 114 wm~i~yrm~r~n, 137. 147. 153, 214. 215

pairuvim corr~lntiun, 54 p~rtial log liketihood, 244 partla1 residual plnt. 7U PCR, st.c prmr i p d wnipmmts analysis P P M O ~ x', 237, 147 F e x c m ~ correlnt~on, 47 Pcsrsor~ rmidud. 125. 21 B pwwn-nrnp of olwri~~t~on, 224 plot. srr: c~mmalrd. graph phq-in, 34 Klirm diskribuhon, 138. 137, 214, 213.

224

Page 343: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

Foiswn remnsior~. 225, 228 pcst-tstitimnt~on commniul, 2 S 2 4

predict, A@@ co~nlriand, estat estimates, aee co~u~uaud, estimates esta t , scc commar~d. estat @lapred, see C O ~ I I I ~ I I ~ . gllapred lincom. ?r~m m m a n d . lincom lrtest. see command. lrtest nlcom, spn rr)mmnnd. nlcom F e d i c t , sw commnnd, prsdlct t e a t , s e commnnd, t e s t testparm. see mmmand, tsstpam

prcfix command, 14 bootstrap, 151

reps0 option, 151 by, 14, 17, 128, 233

sort option, 17, 128 capture, 35, 38, ZW quietly, 35, 266, 271 statsby, 169 stepwise, 72,73

p e 0 oplio11, 72 pro cqt.ttm, 73

Xl, 95 principal components analysis, 281-291 pvofilc likclihwd. 244 p~vgmnming, 34-39. 265-280 proporl~onal hazards, 243 proportrunal odds model, rm urrlinnl lrr

gistic r e v i o n pweighta, 23

as plot, .W quasi-likelihood, 138, 147, Md

r u t d m effecls, 151 rarrdom effects model. 137, 173- 199 random intercept rndel, 175 rendi~lg dela, 9

infile, RW wmmmd, inf i l e insheet, see cornrnuncl, inxheet use, see mmrnnnd, use

regreasion uonditimd logistic, see wnditiond

lo~lstic rcprwslon I~oear, see multiple regmssion lq i s t~c , s e ~ logistic rqtwsion ncgntivc binomid, see ncwtivc bi-

nom~al reflesion

orrlis~nl Icgkdic, see ordi~irlxl iqq$tic r ~ p - % i < m

Poiswai, spa P o i m rcp;&m rcpmion mcficicnt, 63, ti7 rcgmmon diqqcmtia, 77-82 rclatiw: risk, 224 REML, see restr~cled rriaxi~ui~rn likeli-

11ood midnds, 73, 75, 125.148, 256-257 response r e a L w aumlysis 167 response proliles, 164, 211 rigbtceusored, 241 robust y ~ m d m d errors, 138: 149 R O G L U ~ ~ P . 127

=vim datn, B outf r l e , sm conunand, outf ile outsheet, see wu~tnaid, outshest save, s t r conir~tarrd, save

scatterplol matrix, 55, 64, 161,284 +x MP plot. 287 scnrch, 7 e n ~ i l l v i t y . 127 w q ~ w n t i ~ l s ~ l m ~ of square 1011, 1US sin& linkam, 297 SJ, see Stata Jo~irnnl Spearman rank mrrelntion, 17 spnciflcity, 127 SPI, 8W plug-in SSC, &re Statisticnl Software Compw

ncnts standard crror 207 standardrzcr( rcslduxl, 77, 126 Stata Journnt, 39 Stntn T h i c n l Bulletin, 39 Stata w e t pqe, 2 Statistical Software Cemponents, 39 STE, see Ytnta Ttchnlcal Dullet~n slepwlse regression. 71 slopping Stata. 8 s to rw typc, 11

douhlc. 267 ~trafifrrrl Cox model, 244 survival ana1ysi.r: 3 9 survivd data, 241 aurvlvor [uxlclion. 242

Page 344: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

342 . A H d h k o f s w i & d - A d y m Using Stata

t m w a y table, 51 Type I sums of w, sep ~equentid

sum of spares T y p III sums of quare8, see unique

slutla of Bquar+%

unique s u m of qum, 104 updating StRta, 39-40

weights, 14,23, 143

Page 345: 84656216-A-Handbook-of-Statistical-Analyses-Using-Stata-Ver9-4Th-Ed-2007.pdf

- AHan&Ooh of

Statistical

U~AB fata An%=

I SophiaRabe-Hesketh and Brian S. Everitt With each new rel- of Stata, a c o t n ~ ~ ~ v e rescum is needed to highlight th Im~mments as well as distPusa tlwfunctamentrls ofthe edtwam. Fulfllllm thb wc 1 A b i m k mv*. us~ng M ~ d i t t m has been GI& update to probide an lnhductlon to Stata W M n 9. Thia edmon covers many new Mtum c

I st&, Including a new command for mtxed models and a new matrix languagc

Each chap& dwrlbes the analysts appropriate for a ppartlcuhr application, fmuslng o tlme Weal, social, and behaviwarflelds. T h m authors begin chapterw#~ dssedption of the data and the statistical whnlqwa to be used. The methods covered lnclud descr$tive Estatistiw, simple teists, snrlly& d variance, multiple linear r@pdon, logid ragrewn, g e d l r e d linear rndels, survival analysk, random m c t s models, I n cluster a n M s . Th@ core of the b o k Centsrs on how to me Stata to perfomran- and h to interpret the paaub. The &aptem conclude withseveral Wrcises based o dai&s& fmrn diifetwt disdplinm.

Feafirns ~ e m o n s % f & - k w a wide variety of stet4tlstlcal analyses, Including demrlptiv statistics, gmphlcs, model estimation and diagrtodim, can be pwiomed usin st& Features the new mixed-modds dimation command xbnked and thr new matr language Mata IncorporaW numerod e m p l e s thnt use real-life data to explain the mppllcaa'm of the me%& k r l b e d Includes diverse m e t s In many e x e m as well as salected solutrons in th appendix to build a bth vnbrstandlng of Me metMology

Sophia Rabe-HasM Is Profaem of Educational Statist& at lhe Grad& School ofEducatic at the Universfty of California, Bsrltel~yr, US% and Chdr of dcial Ststistie at the lnstituta ( Education, Univmb of London, UK.

6th 9. Ev&U Is EmsriIW P m h o r St&tlcs at the Inst[Me of PsydMy, Klng's Calleg~ Unimlty of Mtn, UK

Chapman & Hall/CRC Taylor&Frands Cmup an lnlurma

w m r t a y b r a n d f r a ~ ~ ~ . a m

M l l l O B m l w r S w n d ~ . A F A l Sub $DR Boa h F L 35447 270 M a d h k m t NmhrllMYlblb ZParkiqm?, hUtcn Paik A b ~ h n Q X 1 4 U R N . V K