Being Productive with Stata and VA Data Give me six hours to chop down a tree and I will spend the...

44
Being Productive Being Productive with Stata with Stata and VA Data and VA Data Give me six hours to chop down a tree and I will spend Give me six hours to chop down a tree and I will spend the first four sharpening the axe. the first four sharpening the axe. --Abraham Lincoln --Abraham Lincoln Todd Wagner Todd Wagner August 2008 August 2008

Transcript of Being Productive with Stata and VA Data Give me six hours to chop down a tree and I will spend the...

Being Productive with StataBeing Productive with Stataand VA Dataand VA Data

Give me six hours to chop down a tree and I will spend the Give me six hours to chop down a tree and I will spend the first four sharpening the axe. first four sharpening the axe.

--Abraham Lincoln--Abraham Lincoln

Todd WagnerTodd WagnerAugust 2008August 2008

OutlineOutline

Database manipulation in StataDatabase manipulation in Stata

Data Analysis in StataData Analysis in Stata

Working Interactively and .do filesWorking Interactively and .do files

You can issue commands directly into the You can issue commands directly into the command line.command line.

Unless you save your commands into a Unless you save your commands into a batch file (a .do file), you’ll lose your batch file (a .do file), you’ll lose your code once you close Stata.code once you close Stata.

I often work interactively and then save I often work interactively and then save the “right” commands in a do file.the “right” commands in a do file.

Editing a .do file in StataEditing a .do file in Stata

Any ASCII text editor will workAny ASCII text editor will work Stata has a built in text editor, but it is Stata has a built in text editor, but it is

limited.limited. I recommend exploring your optionsI recommend exploring your options

http://fmwww.bc.edu/repec/bocode/t/textEditors.htmlhttp://fmwww.bc.edu/repec/bocode/t/textEditors.html

How Stata Thinks How Stata Thinks about Dataabout Data

Handling DataHandling Data

SAS processes one record at a timeSAS processes one record at a time Stata processes all the records at the same Stata processes all the records at the same

timetime– Loops are commonly used in SASLoops are commonly used in SAS

– Loops are very rarely used in StataLoops are very rarely used in Stata

Loading Data into MemoryLoading Data into Memory

Stata reads the data into memoryStata reads the data into memory– set mem 100m set mem 100m (before you load the data)(before you load the data)

You must have enough memory for your You must have enough memory for your datasetdataset

With large datasets:With large datasets:– drop unnecessary variablesdrop unnecessary variables– Use the compress command (but don’t compress Use the compress command (but don’t compress

SCRSSN)SCRSSN)

Stata AbbreviationsStata Abbreviations

Stata commands can be abbreviated with Stata commands can be abbreviated with the first three lettersthe first three letters– regression income education femaleregression income education female

could be writtencould be written– reg income education femalereg income education female

Stata HelpStata Help

Stata’s built in help is greatStata’s built in help is great– Help <command>Help <command>

Stata manuals are great because they Stata manuals are great because they review theoryreview theory

Stata and the WebStata and the Web

Stata is “web aware”Stata is “web aware” Check for updates periodicallyCheck for updates periodically–update allupdate all

You can search for user-written programsYou can search for user-written programs–findit outputfindit output–findit outregfindit outreg (click to install) (click to install)

Stata in WindowsStata in Windows

Page up scrolls through the previous Page up scrolls through the previous commandscommands

There is a graphical user interface There is a graphical user interface (menus) if you forget a command(menus) if you forget a command

In Unix, you can all Stata’s functionality In Unix, you can all Stata’s functionality if you use x-windows (e.g., Cygwin).if you use x-windows (e.g., Cygwin).

Sysdir, ls and cdSysdir, ls and cd Stata recognizes some unix commands, such as ls and Stata recognizes some unix commands, such as ls and

cdcd Sysdir provides a listing of Stata’s working Sysdir provides a listing of Stata’s working

directoriesdirectoriessysdirsysdirSTATA: C:\Program Files\Stata9\STATA: C:\Program Files\Stata9\UPDATES: C:\ProgramFiles\Stata9\ado\updates\UPDATES: C:\ProgramFiles\Stata9\ado\updates\BASE: C:\Program Files\Stata9\ado\base\BASE: C:\Program Files\Stata9\ado\base\SITE: C:\Program Files\Stata9\ado\site\SITE: C:\Program Files\Stata9\ado\site\PLUS: c:\ado\stbplus\PLUS: c:\ado\stbplus\PERSONAL: c:\ado\personal\PERSONAL: c:\ado\personal\OLDPLACE: c:\ado\OLDPLACE: c:\ado\

Store your data on a VA server– not on your PC or Store your data on a VA server– not on your PC or laptop!laptop!

DelimitersDelimiters

SAS recognizes “;” as a delimiterSAS recognizes “;” as a delimiter Stata recognizes the carriage returnStata recognizes the carriage return

– Always add a carriage return after your last Always add a carriage return after your last commandcommand

You can change delimiters to ; You can change delimiters to ; #delimit ;#delimit ;

Missing DataMissing Data

Stata and SAS both use “.” as missingStata and SAS both use “.” as missing Stata implicitly values a missing as a very Stata implicitly values a missing as a very

large numberlarge number SAS implicitly values a missing as a very SAS implicitly values a missing as a very

small numbersmall number

Generating and Recoding VariablesGenerating and Recoding Variables

In SAS you typeIn SAS you typequality=0; quality=0;

If VA=1 then quality=1;If VA=1 then quality=1; In Stata you typeIn Stata you typegen quality=0 gen quality=0

recode quality 0=1 if VA==1 recode quality 0=1 if VA==1 oror

replace quality=1 if VA==1 replace quality=1 if VA==1

Boolean LogicBoolean Logic

Stata is picky about Boolean logicStata is picky about Boolean logic

gen y=x if a==bgen y=x if a==b (must use two ==) (must use two ==)

gen y=x if a>b & b>10gen y=x if a>b & b>10 (must use &) (must use &)

gen y=x if a<=bgen y=x if a<=b (< or > must be before =) (< or > must be before =)

Creating Dummy VariablesCreating Dummy Variables

Goal: create dummy variable for genderGoal: create dummy variable for gender

gen male=sex==“M”gen male=sex==“M”

tab sex, gen(sex_)tab sex, gen(sex_) This second command automatically creates 2 This second command automatically creates 2

dummy variablesdummy variables Be careful about missing data– missing data Be careful about missing data– missing data

are assigned to 0, unless you use “if” or are assigned to 0, unless you use “if” or “recode”“recode”

DropDrop

Drop <varnames>Drop <varnames> (drops variables) (drops variables)

Drop if X==1Drop if X==1 (drop cases where (drop cases where value is 1)value is 1)

egen Commandsegen Commands

You want to generate total costs for a medical You want to generate total costs for a medical centercenter

In SAS this is done by proc summaryIn SAS this is done by proc summary In Stata, you can typeIn Stata, you can typecollapse (sum) costs, by (stan3)collapse (sum) costs, by (stan3) oror

sort sta3nsort sta3n

by sta3n: egen sumcost=total(cost)by sta3n: egen sumcost=total(cost)

ICD-9 CodesICD-9 Codes

Stata has capabilities to handle ICD-9 Stata has capabilities to handle ICD-9 diagnosis and procedure codesdiagnosis and procedure codes

You can You can – check to see if codes are validcheck to see if codes are valid

– generate identifiers based on codes or generate identifiers based on codes or ranges of codesranges of codes

DatesDates

Same date functions as SASSame date functions as SAS

Combining DataCombining Data MergeMerge

– this automatically creates a variable called _mergethis automatically creates a variable called _merge– merge==1 obs. from master data merge==1 obs. from master data – merge==2 obs. from only one using dataset merge==2 obs. from only one using dataset – merge==3 obs. from at least two datasets, master or merge==3 obs. from at least two datasets, master or

using using

merge scrssn admitday disday using data_ymerge scrssn admitday disday using data_y

Append (stacking data)Append (stacking data)

Explicit SubscriptingExplicit Subscripting

Identify the most recent encounter in an Identify the most recent encounter in an encounter databaseencounter database

gsort id -dategsort id -date

by id : gen n=_nby id : gen n=_n

by id : gen N=_Nby id : gen N=_N

gen select=n==1gen select=n==1

Ascending sort by ID and reverse by date

Record counter from 1 to N per person

Total number of records per person

Using StataUsing Stata

Set, Clear and MoreSet, Clear and More

Set: sets system parametersSet: sets system parameters– Need to set memory size to open a databaseNeed to set memory size to open a database

set mem 100mset mem 100m ClearClear erases data from memory erases data from memory When output is >1 page, you are asked to When output is >1 page, you are asked to

continue (continue (set more offset more off))

Summarizing DataSummarizing Data

. sum gender age educ

Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- gender | 4085 1.496206 .5000468 1 2 age | 4085 64.5601 9.451724 50 94 educ | 4085 4.398286 1.662883 1 9

Sum < >, dSum < >, d provides more details on each provides more details on each variablevariable

Tabstat provides summary info, including Tabstat provides summary info, including totalstotals

Tabulating DataTabulating Data. tab gender. tab gender

gender | Freq. Percent Cum.gender | Freq. Percent Cum.------------+-----------------------------------------------+----------------------------------- 1 | 2,058 50.38 50.381 | 2,058 50.38 50.38 2 | 2,027 49.62 100.002 | 2,027 49.62 100.00------------+-----------------------------------------------+----------------------------------- Total | 4,085 100.00Total | 4,085 100.00

. table gender. table gender-------------------------------------------- gender | Freq.gender | Freq.----------+---------------------+----------- 1 | 2,0581 | 2,058 2 | 2,0272 | 2,027--------------------------------------------

Tabulating DataTabulating Datatab gender agetab gender agetoo many valuestoo many valuesr(134);r(134);

tab age gendertab age gender | gender| gender age | 1 2 | Totalage | 1 2 | Total-----------+----------------------+---------------------+----------------------+---------- 50 | 49 69 | 118 50 | 49 69 | 118 51 | 72 71 | 14351 | 72 71 | 143……

94 | 1 0 | 1 94 | 1 0 | 1 -----------+----------------------+---------------------+----------------------+---------- Total | 2,058 2,027 | 4,085 Total | 2,058 2,027 | 4,085

TabstatTabstat. tabstat age, by (gender). tabstat age, by (gender)

gender | meangender | mean---------+-------------------+---------- 1 | 64.774541 | 64.77454 2 | 64.342382 | 64.34238---------+-------------------+---------- Total | 64.5601Total | 64.5601----------------------------------------

. table gender, c(mean age). table gender, c(mean age)

---------------------------------------------- gender | mean(age)gender | mean(age)----------+----------------------+------------ 1 | 64.774541 | 64.77454 2 | 64.342382 | 64.34238----------------------------------------------

GraphingGraphing

Diagnostic graphicsDiagnostic graphics

Presenting Presenting

resultsresults

wtp

Density-.072394.072394

0

75

500

stage: 1

Density-.060237.060237

0

100

500

stage: 2

Density-.05479 .05479

0

100

500

stage: 3

Density-.055777.055777

0

125

500

stage: 4

Density-.062437.062437

0

75

500

stage: 5

Basic Analytical FunctionsBasic Analytical Functions

OLS (reg)OLS (reg) Logistic, probit, count data (e.g., CLAD)Logistic, probit, count data (e.g., CLAD) MultinomialsMultinomials GLM/HLMGLM/HLM Duration modelsDuration models Semi and non-parametric modelsSemi and non-parametric models

Creating Publishable TablesCreating Publishable Tables

Outreg commandOutreg command

Outputs data to a delimited fileOutputs data to a delimited file Delimited file can be read into ExcelDelimited file can be read into Excel Very flexibleVery flexible Creates publishable tables easilyCreates publishable tables easily

Example with VA dataExample with VA data

BecaplerminBecaplermin

June 2006, FDA issued a Boxed Warning June 2006, FDA issued a Boxed Warning for becaplerim (a treatment for lower for becaplerim (a treatment for lower extremity diabetic ulcers)extremity diabetic ulcers)

Warning raised potential risk of cancer Warning raised potential risk of cancer related mortalityrelated mortality

Analytical GoalAnalytical Goal

Case-control study for becaplerminCase-control study for becaplermin Sample is all patients with a diabetic Sample is all patients with a diabetic

ulcer of the lower extremityulcer of the lower extremity Exposure is quantity of becaplermin Exposure is quantity of becaplermin

prescriptionsprescriptions Multivariate analysis, stratifying for Multivariate analysis, stratifying for

patients with prior history of cancerpatients with prior history of cancer

Pulling VA DataPulling VA Data VA utilization data extracts reside in SAS. I VA utilization data extracts reside in SAS. I

extract my sample using SAS and then moved extract my sample using SAS and then moved the data into Stata.the data into Stata.

VA Data:VA Data:– Sample: All encounters with a diabetic ulcer Sample: All encounters with a diabetic ulcer

principal diagnosis in NPCD and PTF (FY02-07)principal diagnosis in NPCD and PTF (FY02-07)– Exposure: All prescriptions from DSS pharmacy Exposure: All prescriptions from DSS pharmacy

FY02-07 for Becaplermin feeder codeFY02-07 for Becaplermin feeder code– Outcome: All encounters with a neoplasm Outcome: All encounters with a neoplasm

principal diagnosis (FY97-07)principal diagnosis (FY97-07)

Transferring DataTransferring Data

Stattransfer or DBMS copy workStattransfer or DBMS copy work Stattransfer often seeks to optimize the Stattransfer often seeks to optimize the

Stata dataset by defaultStata dataset by default– If transferring data with SCRSSN, If transferring data with SCRSSN, FORCEFORCE

Stattransfer to transfer SCRSSN as double Stattransfer to transfer SCRSSN as double precisionprecision

– http://www.stata.com/support/faqs/data/prec.htmlhttp://www.stata.com/support/faqs/data/prec.html

StattransferStattransfer

CLICK ON DOUBLE

Diabetic Ulcer SampleDiabetic Ulcer Sample Goal: turn encounter level data into person level dataGoal: turn encounter level data into person level data

cd R:\twagner\customer\becapcd R:\twagner\customer\becapuse ulcer, clearuse ulcer, clearsort scrssnsort scrssnby scrssn: gen n=_nby scrssn: gen n=_ntab ntab nkeep if n==1keep if n==1keep scrssnkeep scrssnsort scrssnsort scrssngen ulcer=1gen ulcer=1save finder, replacesave finder, replace

Alternative CodeAlternative Code

sort scrssnsort scrssn

by scrssn: gen n=_nby scrssn: gen n=_n

by scrssn: gen num_ulcervisits=_Nby scrssn: gen num_ulcervisits=_N

sort scrssnsort scrssn

by scrssn: gen newepisode=vizday[_n]-by scrssn: gen newepisode=vizday[_n]-vizday[_n-1]>60vizday[_n-1]>60

recode newepisode .=1 if n==1recode newepisode .=1 if n==1

by scrssn: egen episodes=sum(newepisode)by scrssn: egen episodes=sum(newepisode)

Step 2: Merge Ulcer Sample and Step 2: Merge Ulcer Sample and Cancer CasesCancer Cases

use neo, clearuse neo, cleargen cancer=1gen cancer=1sort scrssnsort scrssnmerge scrssn using findermerge scrssn using finderdrop if _m==1drop if _m==1

Merge command creates a new variable:_m=1 data only in master data_m=2 data only in using data_m=3 data merged in both

sort scrssn admitday disday sta3n adtimesort scrssn admitday disday sta3n adtimeby scrssn: egen firstcancer=min(admitday) if cancer==1by scrssn: egen firstcancer=min(admitday) if cancer==1

gen diedihcan=disto==-2 & cancer==1gen diedihcan=disto==-2 & cancer==1gen dod_can=disday if diedihcan==1gen dod_can=disday if diedihcan==1

gen cancerstays=1 if cancer==1gen cancerstays=1 if cancer==1recode cancerstays .=0recode cancerstays .=0collapse (min) firstcancer (sum) cancerstays (max) collapse (min) firstcancer (sum) cancerstays (max)

diedihcan dod_can cancer, by (scrssn)diedihcan dod_can cancer, by (scrssn)sort scrssnsort scrssndrop _mdrop _msave diabcancer, replacesave diabcancer, replace

Step 2: continuedStep 2: continued

Merge in Exposure dataMerge in Exposure datause becap, clearuse becap, cleargen numrx=1gen numrx=1sort scrssn svc_dtesort scrssn svc_dteby scrssn: egen firstbecap=min(svc_dte)by scrssn: egen firstbecap=min(svc_dte)by scrssn: egen lastbecap=max(svc_dte)by scrssn: egen lastbecap=max(svc_dte)collapse (min) firstbecap (max) lastbecap (sum) collapse (min) firstbecap (max) lastbecap (sum)

day_supply numrx , by (scrssn)day_supply numrx , by (scrssn)gen becap=1gen becap=1sort scrssnsort scrssnsave becapsum, replacesave becapsum, replace

use diabcanceruse diabcancermerge scrssn using becapsummerge scrssn using becapsum