Automating Your Work: An Introduction to Programming in Stata

49
Automating Your Work: An Introduction to Programming in Stata Shawna N. Smith 29 July 2009

description

Automating Your Work: An Introduction to Programming in Stata. Shawna N. Smith 29 July 2009. …but why?. GSS Mental Health Replication Study Respondents received one of four different vignettes: depression, schizophrenia, alcohol abuse, normal troubles 38 outcomes [binary ] - PowerPoint PPT Presentation

Transcript of Automating Your Work: An Introduction to Programming in Stata

Page 1: Automating Your Work: An Introduction to  Programming in  Stata

Automating Your Work:An Introduction to

Programming in Stata

Shawna N. Smith29 July 2009

Page 2: Automating Your Work: An Introduction to  Programming in  Stata

2

…but why?

• GSS Mental Health Replication Study• Respondents received one of four different

vignettes: depression, schizophrenia, alcohol abuse, normal troubles

• 38 outcomes [binary]• Two waves of data: 1996 & 2006• First question: Is there a survey year difference?• 4 vignettes x 38 outcomes = 152 potential

differences

Page 3: Automating Your Work: An Introduction to  Programming in  Stata

3

Roadmap

• Writing effective do-files [Review]• Automation– Macros– Using stored info – foreach and forvalues loops– Ado-files {brief preview}

Page 4: Automating Your Work: An Introduction to  Programming in  Stata

4

The Workflow of Data Analysis:Principles and Practices

By J. Scott Long• Much of this talk is from Chapter 4:

Automating your work • For example files: type findit workflow and

follow the instructions

Page 5: Automating Your Work: An Introduction to  Programming in  Stata

5

[aside] Writing effective do-files

• Robust: To be robust, a do-file must produce exactly the same result when run at a later time or on another computer

• Legible: To be legible, a do-file must be documented and formatted so that it is easy to understand what is being done

Page 6: Automating Your Work: An Introduction to  Programming in  Stata

6

Robust

• Self-contained• Include version control• Exclude directory information– Never hardcode your directory! Rather set your

working directory before you start your work

Page 7: Automating Your Work: An Introduction to  Programming in  Stata

7

Legible

• Use comments• Use alignment and indentation• Use short lines [<80 characters]• Limit the use of abbreviations

Page 8: Automating Your Work: An Introduction to  Programming in  Stata

8

Automating Your Work

• Macros• Saved results• Loops• Ado-files {brief preview}

Page 9: Automating Your Work: An Introduction to  Programming in  Stata

9

Macros• A macro assigns a string of text or a number to an abbreviation• Two types of macros, {local} & {global}• {Global}

– Persists until you delete it or exit Stata – Can lead to do-files that unintentionally depend on a global macro

created by another do-file– Such do-files are not robust and can lead to unpredictable results

• *{Local}– Can only be used within the do-file or ado-file in which they are defined– When that program ends, the local macro disappears

• Macros are the simplest tool for automating your work

Page 10: Automating Your Work: An Introduction to  Programming in  Stata

10

Syntax

• local local-name “string”– local rhs “var1 var2 var3”– display “The local rhs contains: `rhs’”

• local local-name = expression– local ncases = 198– display “The local ncases equals: `ncases’”

• With the equals sign, expression is limited to 80 characters; without, “string” is limited to 67,784 characters. It is usually better to use “string”

Page 11: Automating Your Work: An Introduction to  Programming in  Stata

11

Here is a simple example. I want to estimate the model:

. logit y var1 var2 var3

I can create the macro rhs with the names of the independent or right-hand-side variables:

. local rhs “var1 var2 var3”

Then, I can write the logit command as:

. logit y `rhs’

where the ` and ‘ indicate that I want to insert the contents of the macro rhs.

i.e., the command:logit y `rhs’

works exactly the same as logit y var1 var2 var3

Page 12: Automating Your Work: An Introduction to  Programming in  Stata

12

Page 13: Automating Your Work: An Introduction to  Programming in  Stata

13

Macros can be combined to specify a sequence of nested models. First, I create macros for four groups of independent variables:

. local set1_age “age agesquared”

. local set2_educ “wc hc”

. local set3_kids “k5 k618”

. local set4_money “lwg inc”

Next, I specify four nested models. The first model includes only the first set of variables and is specified as:

. local model_1 “`set1_age’”

The macro model_2 combines the content of the local model_1 with the variables in local set2_educ:

. local model_2 “`model_1’ `set2_educ’”

The next two models are specified the same way:

. local model_3 “`model_2’ `set3_kids’”

. local model_4 “`model_3’ `set4_money’”

Page 14: Automating Your Work: An Introduction to  Programming in  Stata

14

Next, I check the variables in each model:

. display “model_1: `model_1’”model_1: age agesquared

. display “model_2: `model_2’”model_2: age agesquared wc hc

. display “model_3: `model_3’”model_3: age agesquared wc hc k5 k618

. display “model_4: `model_4’”model_4: age agesquared wc hc k5 k618 lwg inc

Using these locals, I estimate a series of logits:

. logit lfp `model_1’

. logit lfp `model_2’

. logit lfp `model_3’

. logit lfp `model_4’

Page 15: Automating Your Work: An Introduction to  Programming in  Stata

15

The whole thing:

. local set1_age “age agesquared”

. local set2_educ “wc hc”

. local set3_kids “k5 k618”

. local set4_money “lwg inc”

. local model_1 “`set1_age’”

. local model_2 “`model_1’ `set2_educ’”

. local model_3 “`model_2’ `set3_kids’”

. local model_4 “`model_3’ `set4_money’”

. display “model_1: `model_1’”model_1: age agesquared

. display “model_2: `model_2’”model_2: age agesquared wc hc

. display “model_3: `model_3’”model_3: age agesquared wc hc k5 k618

. display “model_4: `model_4’”model_4: age agesquared wc hc k5 k618 lwg inc

. logit lfp `model_1’

. logit lfp `model_2’

. logit lfp `model_3’

. logit lfp `model_4’

Page 16: Automating Your Work: An Introduction to  Programming in  Stata

16

Automating Your Work

• Macros• Saved results• Loops• Ado-files {brief preview}

Page 17: Automating Your Work: An Introduction to  Programming in  Stata

17

Saved results

• Stata commands send results to your log file but also save those results to memory

Drukker’s Dictum: Never type anything that you can obtain from a

saved result

• This information can be moved into macros and matrices, and used in many ways

Page 18: Automating Your Work: An Introduction to  Programming in  Stata

18

Consider a simple example using -prvalue-.

Use -prvalue- to calculate discrete change for DSD of age centered on the mean)[The old way…]

. sum age

Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- age | 753 42.53785 8.072574 30 60

. di 42.53785 + (8.072574/2)46.574137

. di 42.53785 - (8.072574/2)38.501563

. qui prvalue, x(age=46.574137) rest(mean) save label(SD-)

. prvalue, x(age=38.501563) rest(mean) dif label(SD+)

:::

Page 19: Automating Your Work: An Introduction to  Programming in  Stata

19

A simpler [& more robust] way:

. local c “age”

. sum `c’

Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- age | 753 42.53785 8.072574 30 60

. return list

scalars: r(N) = 753 r(sum_w) = 753 r(mean) = 42.53784860557769 // scalar for mean of age r(Var) = 65.16645121641095 r(sd) = 8.072574014303674 // scalar for sd of age r(min) = 30 r(max) = 60 r(sum) = 32031

. local sdup = r(mean) + (r(sd)/2)

. local sddn = r(mean) - (r(sd)/2)

. qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD-)

. prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+)

:::

Page 20: Automating Your Work: An Introduction to  Programming in  Stata

20

Question: I discover a problem with my age variable & decide to

change my C to income. Which parts of the above code do I need to change if:

[1] I ‘hardcoded’ my numbers; & [2] I used the locals & scalars?

Page 21: Automating Your Work: An Introduction to  Programming in  Stata

21

Automating Your Work

• Macros• Saved results• Loops• Ado-files {brief preview}

Page 22: Automating Your Work: An Introduction to  Programming in  Stata

22

foreach and forvalues loops

• Loops let you execute a group of commands multiple times

• By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models

• Loops can be used in many ways that make your workflow faster and more accurate. For example:– Creating interaction variables– Using the same command for multiple variables– Using information returned by Stata for other purposes

Page 23: Automating Your Work: An Introduction to  Programming in  Stata

23

Syntax: foreach

• foreach local-name in | of list-type list {commands referring to `local-name’

}– foreach name in var1 var2 var3 {– foreach var of varlist var1-var10 {

Page 24: Automating Your Work: An Introduction to  Programming in  Stata

24

Syntax: forvalues• forvalues lname = range {

commands referring to `lname’ }– forvalues nage = 40(5)80 {– forvalues n = 1(.1)100 {

Syntax Meaning Example Generates#1(#d)#2 From #1 to #2 in steps of #d. 1(2)10 1, 3, 5, 7, 9

#1/#2 From #1 to #2 in steps of 1. 1/10 1, 2, 3, ..., 10

#1 #t to #2 From #1 to #2 in steps of (#t-#1) 1 4 to 15 1, 4, 7, 10, 13

Page 25: Automating Your Work: An Introduction to  Programming in  Stata

25

Here is a simple example that illustrates the key features of loops.

I have a four-category ordinal variable y with values from 1 to 4. I want to create the binary variables y_lt2, y_lt3, and y_lt4 that equal 1 if y is less than the indicated value, else 0.

I can create the variables with three generate commands:

. generate y_lt2 = y<2 if y<.

. generate y_lt3 = y<3 if y<.

. generate y_lt4 = y<4 if y<.

Page 26: Automating Your Work: An Introduction to  Programming in  Stata

26

I can do the same thing with a foreach loop:

1> foreach cutpt in 2 3 4 {2> generate y_lt‘cutpt’ = y<‘cutpt’ if y<.3> }

The first time through the local cutpt is assigned the first value in the list.

Next, the generate command is run, where ‘cutpt’ is replaced by the value assigned to cutpt. The first time through the loop, line 2 is evaluated as:

. generate y_lt2 = y<2 if y<.

Next, the closing brace } is encountered, which sends us back to the foreach command in line 1.

In the second pass, foreach assigns cutpt to the second value in the list, which means that the generate command is evaluated as:

. generate y_lt3 = y<3 if y<.

This continues once more, assigning cutpt to 4. When the foreach loop ends, three variables have been generated.

Page 27: Automating Your Work: An Introduction to  Programming in  Stata

27

foreach and forvalues loops

• Loops let you execute a group of commands multiple times

• By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models

• Loops can be used in many ways that make your workflow faster and more accurate. For example:– Creating interaction variables– Using the same command for multiple variables– Generating matrices from returned information

Page 28: Automating Your Work: An Introduction to  Programming in  Stata

28

Suppose that I need variables that are interactions between the binary variable male and a set of independent variables.

I can do this quickly with a loop:

1> foreach varname of varlist yr89 white age ed prst {2> generate maleX‘varname’ = male*‘varname’3> label var maleX‘varname’ "male*‘varname’"4> }

To examine the new variables and their labels, I use codebook:

. codebook maleX*, compact

Variable Obs Unique Mean Min Max Label---------------------------------------------------------------------------maleXyr89 2293 2 .1766245 0 1 male*yr89maleXwhite 2293 2 .4147405 0 1 male*whitemaleXage 2293 71 20.50807 0 89 male*agemaleXed 2293 21 5.735717 0 20 male*edmaleXprst 2293 59 18.76625 0 82 male*prst---------------------------------------------------------------------------

How can we use what we learned about extended macros to improve upon this?

Page 29: Automating Your Work: An Introduction to  Programming in  Stata

29

foreach and forvalues loops

• Loops let you execute a group of commands multiple times

• By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models

• Loops can be used in many ways that make your workflow faster and more accurate. For example:– Creating interaction variables– Using the same command for multiple variables– Generating matrices from returned information

Page 30: Automating Your Work: An Introduction to  Programming in  Stata

30

Suppose I want to estimate discrete change for a Dsd (using the -prvalue, save- & -dif-) for multiple continuous variables.

Earlier, we used the following commands:

. local c “age”

. sum `c’

. local sdup = r(mean) + (r(sd)/2)

. local sddn = r(mean) - (r(sd)/2)

. qui prvalue, x(`c’=`sddn’) rest(mean) save label(SD-)

. prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+)

To expand this to multiple continuous variables, we’ll use a -foreach- loop:

foreach var in age lwg {qui sum `var’local sdup = r(mean) + (r(sd)/2)local sddn = r(mean) - (r(sd)/2)di “”di “**Change in `var’ from `sddn’ to `sdup’” qui prvalue, x(`var’=`sddn’) rest(mean) save label(SD-)prvalue, x(`var’=`sdup’) rest(mean) dif label(SD+)}

Page 31: Automating Your Work: An Introduction to  Programming in  Stata

31

Output:

**Change in age from 38.50156159842585 to 46.57413561272952

logit: Change in Predictions for lfp

Confidence intervals by delta method

SD+ SD- Current Saved Change 95% CI for Change Pr(y=inLF|x): 0.5150 0.6382 -0.1232 [-0.1717, -0.0747] Pr(y=NotInLF|x): 0.4850 0.3618 0.1232 [ 0.0747, 0.1717]

k5 k618 age wc hc lwg incCurrent= .2377158 1.3532537 46.574136 .2815405 .39176627 1.0971148 20.128965 Saved= .2377158 1.3532537 38.501562 .2815405 .39176627 1.0971148 20.128965 Diff= 0 0 8.072574 0 0 0 0

**Change in lwg from .8033366225286708 to 1.390893047643295

logit: Change in Predictions for lfp

Confidence intervals by delta method

SD+ SD- Current Saved Change 95% CI for Change Pr(y=inLF|x): 0.6204 0.5340 0.0865 [ 0.0445, 0.1285] Pr(y=NotInLF|x): 0.3796 0.4660 -0.0865 [-0.1285, -0.0445]

k5 k618 age wc hc lwg incCurrent= .2377158 1.3532537 42.537849 .2815405 .39176627 1.390893 20.128965 Saved= .2377158 1.3532537 42.537849 .2815405 .39176627 .80333662 20.128965 Diff= 0 0 0 0 0 .58755643 0

Page 32: Automating Your Work: An Introduction to  Programming in  Stata

32

Question:If I wanted to additionally compute the discrete change

for a Dsd for income—what would I need to change?

foreach v in age lwg {qui sum `v’local sdup = r(mean) + (r(sd)/2)local sddn = r(mean) - (r(sd)/2)di “”di “**Change in `v’ from `sddn’ to `sdup’” qui prvalue, x(`c’=`sddn’) rest(mean) save

label(SD-)prvalue, x(`c’=`sdup’) rest(mean) dif label(SD+)

}

Page 33: Automating Your Work: An Introduction to  Programming in  Stata

33

foreach and forvalues loops

• Loops let you execute a group of commands multiple times

• By combining macros with loops, you can speed up tasks ranging from creating variables to estimating models

• Loops can be used in many ways that make your workflow faster and more accurate. For example:– Creating interaction variables– Using the same command for multiple variables– Generating matrices from returned information

Page 34: Automating Your Work: An Introduction to  Programming in  Stata

34

As mentioned earlier, when we run a command in Stata, it stores the information in memory. We can access it from there & use it in our program. This includes both scalars [as seen from -sum-, prior], but also matrices:

. qui logit lfp k5 k618 age wc hc lwg inc

. ereturn list

scalars: e(N) = 753 [:::]macros: e(title) : "Logistic regression”

[:::]matrices: e(b) : 1 x 8 e(V) : 8 x 8 e(rules) : 1 x 4

. mat list e(b)

e(b)[1,8] k5 k618 age wc hc lwg [:::] y1 -1.462913 -.06457068 -.06287055 .80727378 .11173357 .60469312

Page 35: Automating Your Work: An Introduction to  Programming in  Stata

35

Many commands creates matrices we can use to, e.g., create cumulative matrices.

For example, running -prvalue, save- & -dif- generates the following matrices:

. prvalue, x(age=20) dif

[:::]

. matrix dir _PEtemp[3,7] pedifsep[2,1] pelower[7,2] //Matrix for lower CI bound peupper[7,2] //Matrix for upper CI bound pepred[7,2] //Matrix that includes discrete change peinfo[3,12] pebase[3,7] PE_in[1,7] PE_base[1,7] PRVinfo[1,12] PRVlower[2,2] PRVupper[2,2] PRVmisc[1,2] PRVprob[1,2] PRVbase[1,7] _PRVsav[1,6] pegrad_pr[2,8]

Page 36: Automating Your Work: An Introduction to  Programming in  Stata

36

. matrix list pepred

pepred[7,2] c1 c21values 0 1 2prob .15049911 .84950089 3misc 1.7306918 . saved= .06454021 .93545979 saved= 2.6737502 . saved= .0859589 -.0859589 // Discrete change [6,2] saved= -.94305837 .

Page 37: Automating Your Work: An Introduction to  Programming in  Stata

37

We can make use of these stored matrices to generate our own matrix of discrete change coefficients & confidence intervals

matrix dc = J(9,4,.) //create empty matrix with 9 rows & 4 columnsmatrix colnames dc = x dc dcLB dcUB //label columnslocal irow1 = 0 //initialize a counter that will indicate row where I want to

put info

forvalues n = 30(5)70 {local ++irow1 //this adds 1 to the counterprvalue , x(wc=1 age=`n') save rest(mean) lab(WC)prvalue , x(wc=0 age=`n') diff rest(mean) lab(noWC)matrix dc[`irow1',1] = `n' matrix dc[`irow1',2] = pepred[6,2]matrix dc[`irow1',3] = pelower[6,2]matrix dc[`irow1',4] = peupper[6,2]mat list dc}

Final output:dc[9,4] x dc dcLB dcUBr1 30 .13744253 .06500561 .20987945r2 35 .16046226 .07829062 .2426339r3 40 .18021131 .08726909 .27315353r4 45 .19384143 .09151363 .29616923r5 50 .19910133 .09109842 .30710424r6 55 .19506124 .08578821 .30433427r7 60 .1824384 .07516111 .2897157r8 65 .16334447 .05965045 .26703849r9 70 .14059515 .04131751 .2398728

Page 38: Automating Your Work: An Introduction to  Programming in  Stata

38

And for my final trick:

//change matrix to variablessvmat dc , names(col)label var x "value of x"label var dc "discrete change"label var dcLB "95% CI"label var dcUB "95% CI"

twoway ///(connected dcLB x, msymbol(i) clpat(dash) clwidth(medthick) clcolor(blue))

///(connected dc x, msymbol(i) clpat(solid) clwidth(medthick) ) ///(connected dcUB x, msymbol(i) clpat(dash) clwidth(medthick) clcolor(blue))

/// , ytitle(Pr(Wife no college)-Wife college)) ylabel(0(.2)1) ///

xtitle(age) xlabel(30(5)70) ///legend(pos(11) order(2 1) ring(0) cols(1) region(ls(none))) ///title(”Labor Force Participation by" ”Wife’s College Attendance")

Page 39: Automating Your Work: An Introduction to  Programming in  Stata

39

Page 40: Automating Your Work: An Introduction to  Programming in  Stata

40

Ado-files

• Ado-files are like do-files, except that they are automatically run

• Indeed, ado stands for automatically loaded do-file• Stata 10 has nearly 2,000 ado-files• When you run a command, you cannot tell whether

it is part of the executable or is an ado-file • This means that Stata users like you can write new

commands and use them just like official Stata commands

Page 41: Automating Your Work: An Introduction to  Programming in  Stata

41

Ado-files: An Example

• List variables names and labels• nmlabel.ado

Page 42: Automating Your Work: An Introduction to  Programming in  Stata

42

My first version of nmlabel lists the names and labels with no options. It looks like this:

1> *! version 1.0.0 \ trm 2008-03-292> program define nmlabelV13> version 104> syntax varlist5> foreach varname in ‘varlist’ {6> local varlabel : variable label ‘varname’7> display in yellow "‘varname’" _col(10) "‘varlabel’"8> }9> end

Here is how the command works:

. nmlabelV1 lfp-inclfp Paid Labor Force: 1=yes 0=nok5 # kids < 6k618 # kids 6-18age Wife's age in yearswc Wife College: 1=yes 0=nohc Husband College: 1=yes 0=nolwg Log of wife's estimated wagesinc Family income excluding wife‘s

Page 43: Automating Your Work: An Introduction to  Programming in  Stata

43

The new version of the program looks like this:

1> *! version 2.0.0 \ trm 2008-03-292> program define nmlabelV23> version 104> syntax varlist [, skip]5> if "‘skip’"=="skip" {6> display7> }8> foreach varname in ‘varlist’ {9> local varlabel : variable label ‘varname’10> display in yellow "‘varname’" _col(10) "‘varlabel’"11> }12> end

If I enter the command with the skip option, the syntax command in line 4 creates a local named skip that contains the string skip:

local skip “skip”

If I do not specify the skip option, syntax creates the local skip as a null string:

local skip “”

Page 44: Automating Your Work: An Introduction to  Programming in  Stata

44

The third version looks like this:

1> *! version 3.0.0 \ trm 2008-03-292> program define nmlabelV33> version 104> syntax varlist [, skip NUMber ]5> if "‘skip’"=="skip" {6> display7> }8> local varnumber = 09> foreach varname in ‘varlist’ {10> local ++varnumber11> local varlabel : variable label ‘varname’12> if "‘number’"=="" { // do not number lines13> display in yellow "‘varname’" _col(10) "‘varlabel’"14> }15> else { // number lines16> display in green "#‘varnumber’: " ///17> in yellow "‘varname’" _col(13) "‘varlabel’"18> }19> }20> end

Page 45: Automating Your Work: An Introduction to  Programming in  Stata

45

Here is the new ado-file:

1> *! version 4.0.0 \ trm 2008-03-292> program define nmlabelV43> version 104> syntax varlist [, skip NUMber COLnum(integer 16)]5> if "‘skip’"=="skip" {6> display7> }8> local varnumber = 09> foreach varname in ‘varlist’ {10> local ++varnumber11> local varlabel : variable label ‘varname’12> if "‘number’"=="" { // do not number lines13> display in yellow "‘varname’” 14> _col(‘colnum’) "‘varlabel’"15> }15> else { // number lines16> display in green "#‘varnumber’: " ///17> in yellow _col(6) "‘varname’" ///18> _col(‘colnum’) "‘varlabel’"19> }20> }21> end

Page 46: Automating Your Work: An Introduction to  Programming in  Stata

46

Extra slides

Page 47: Automating Your Work: An Introduction to  Programming in  Stata

47

Counters are so useful that Stata has a simpler way to increment them. The command local ++counter is equivalent to local counter = ‘counter’ + 1. So instead of this:

local counter = 0

foreach varname of varlist warm yr89 male white age ed prst { local counter = ‘counter’ + 1 local varlabel : variable label ‘varname’ display "‘counter’. ‘varname’" _col(12) "‘varlabel’“}

We can use this:

local counter = 0

foreach varname of varlist warm yr89 male white age ed prst { local ++counter local varlabel : variable label ‘varname’ display "‘counter’. ‘varname’" _col(12) "‘varlabel’"}

Page 48: Automating Your Work: An Introduction to  Programming in  Stata

48

Next, I use a matrix command to create a matrix named stats: matrix stats = J(‘nvars’,2,.)

The J function creates a matrix based on three arguments. The first is the number of rows, the second the number of columns, and the third is the value used to fill the matrix. In this case, I want the matrix to be initialized with missing values which are indicated by a period.

The matrix looks like this: . matrix list statsstats[6,2] c1 c2r1 . .r2 . .r3 . .r4 . .r5 . .r6 . .

Page 49: Automating Your Work: An Introduction to  Programming in  Stata

49

Nested Loops

You can nest loops by placing one loop inside of another loop.

Consider the earlier example of creating binary variables indicating if y was less than a given value:

1> foreach cutpt in 2 3 4 {2> generate y_lt‘cutpt’ = y<‘cutpt’ if y<.3> }

Suppose that I need to do this for variables ya, yb, yc, and yd.

1> foreach y of varlist ya yb yc yd { // loop 1 begins2> foreach cutpt in 2 3 4 { // loop 2 begins3> * create binary variable4> generate ‘y’_lt‘cutpt’ = `y’<‘cutpt’ if `y’<.9> } // loop 2 ends10> } // loop 1 ends

What is the first variable created? the last?