Intermediate STATA

IntermediateIntermediate

STATASTATA

AdriAdrián de la Garzaán de la Garza Jeremy GreenJeremy Green

27 March 200927 March 2009

04/21/23 1

2

Getting HelpGetting Help

STATA Help:STATA Help:Just type Just type helphelp in STATA main Command window. in STATA main Command window.

STATA listserv:STATA listserv:http://www.stata.com/statalist/

UCLA Stat Computing:UCLA Stat Computing:http://www.ats.ucla.edu/stat/stata/

Yale StatLab Consultants, online help Yale StatLab Consultants, online help and FAQs:and FAQs:http://statlab.stat.yale.edu/help/ Manuals also available at SSL and Yale StatLab.Manuals also available at SSL and Yale StatLab.

0. Introduction

http://www.stata.com/statalist/

http://www.ats.ucla.edu/stat/stata/

http://statlab.stat.yale.edu/help/

3

Today’s WorkshopToday’s Workshop

1. Programming/Project Management Tips2. Data Management3. Analyzing Data

- Graphs- Statistical Analysis

Latest version: STATA v. 10: Commands throughout this presentation will always refer Commands throughout this presentation will always refer to this version, although most are backwards-compatible.to this version, although most are backwards-compatible.

0. Introduction

4

Using DO files (1/2)Using DO files (1/2) DO files allow you to run a whole program DO files allow you to run a whole program

interactively; you can run it all at once or interactively; you can run it all at once or select portions of the program.select portions of the program.

AVOID making changes to your original AVOID making changes to your original data interactively using the STATA data interactively using the STATA command window. Use DO files instead.command window. Use DO files instead.

Use DO files to make changes to your data Use DO files to make changes to your data and to run your statistical and graphical and to run your statistical and graphical analyses. Keep track of your progress.analyses. Keep track of your progress.

1. Programming/Project Management Tips

5

Using DO files (2/2)Using DO files (2/2) Keep your DO files organized: Helps to create Keep your DO files organized: Helps to create

a main DO file from which you run other DO a main DO file from which you run other DO files that perform smaller tasks on your data.files that perform smaller tasks on your data.

Write lots of comments in your DO file to help Write lots of comments in your DO file to help you remember what a command or a section you remember what a command or a section of your DO file does. This will help you of your DO file does. This will help you remember what you did months ago.remember what you did months ago.

To open DO file, use FILE menu or DO-file To open DO file, use FILE menu or DO-file button.button.


6

Log filesLog filesSyntax

Open log file log using filename [, append replace [text|smcl] name(logname)]

Close log, temporarily suspend logging, or resume logging log {close|off|on} [logname]

Examples

. log using mylog . log close . log using mylog, append . log close . log using "filename containing spaces"


Managing Your Data

Back up all Master Data Files• CD, USB drive, network

Keep a detailed codebook Describes each variable and values Adding variables, cases, computing new

variables

Keep a roadmap Keep a log of all analyses with what you have

done Save syntax files

2. Data Management 7

8

Inspecting Your Data (1/3)Inspecting Your Data (1/3)

cd “C:\Documents and Settings\Adrian\My Documents\stata files”cd “C:\Documents and Settings\Adrian\My Documents\stata files”clear clear set mem 80mset mem 80mlog using “C:\Documents and Settings\Adrian\My Documents\stata files\log using “C:\Documents and Settings\Adrian\My Documents\stata files\

logs\mylog”logs\mylog”sysuse censussysuse censusbrowsebrowselist state region pop if _n <= 3list state region pop if _n <= 3 /* shows first 3 obs *//* shows first 3 obs */l state region pop if _N - _n <= 2l state region pop if _N - _n <= 2 /* shows last 3 obs *//* shows last 3 obs */l state region pop in 1/3l state region pop in 1/3 /* shows first 3 obs *//* shows first 3 obs */l state region pop in -3/ll state region pop in -3/l /* shows last 3 obs *//* shows last 3 obs */

2. Data Management

9

Inspecting Your Data (2/3)Inspecting Your Data (2/3)generate agesq = medage^2generate agesq = medage^2 /* creates variable equal to /* creates variable equal to medagemedage squared squared */*/

sum popsum pop /* shows summary stats for /* shows summary stats for poppop */ */

scalar popmean = r(mean)scalar popmean = r(mean) /* saves mean of /* saves mean of poppop to scalar to scalar popmean popmean */*/

/* create variable equal to 1 when /* create variable equal to 1 when poppop > > popmean popmean and 0 otherwise */and 0 otherwise */g dummy = 0 g dummy = 0 replace dummy = 1 if pop > popmeanreplace dummy = 1 if pop > popmean

/* how many states have population higher than average? *//* how many states have population higher than average? */count if dummy == 1count if dummy == 1

/* how many states /* how many states NOT IN THE SOUTHNOT IN THE SOUTH have have pop > popmeanpop > popmean? */? */count if dummy == 1 & region != 3count if dummy == 1 & region != 3

2. Data Management

10

Inspecting Your Data (3/3)Inspecting Your Data (3/3)describedescribelabel listlabel list /* shows all labels attached to dataset *//* shows all labels attached to dataset */label list cenreglabel list cenreg /* shows label /* shows label cenregcenreg attached to variable attached to variable

regionregion */ */

sum popsum popbrowsebrowse

/* summarize population by region *//* summarize population by region */sum pop if region == “NE” sum pop if region == “NE” /* this gives an error since region is not a string *//* this gives an error since region is not a string */

sum pop if region == 1sum pop if region == 1 /* this does work *//* this does work */

2. Data Management

11

Calculate mean population by Calculate mean population by regionregion

Method 1Method 1

ssum pop if region == 1um pop if region == 1




Downside: Downside:

We have to type the We have to type the sumsum command for each individual region. If the dataset contained population data by city and we had to compute means for each of the 50 states, typing the command for each individual region. If the dataset contained population data by city and we had to compute means for each of the 50 states, typing the sumsum command 50 times would be very painful!!! command 50 times would be very painful!!!

2. Data Management

12


Method 2Method 2

bbyyssoorrtt rreeggiioonn:: ssuumm ppoopp

Downside: Downside:

This method shows the population means by region, like we wanted, but it also shows a bunch of other stats we may not care This method shows the population means by region, like we wanted, but it also shows a bunch of other stats we may not care about. Also, the means are stored in memory but are not readily available for use in case we want to use those means for about. Also, the means are stored in memory but are not readily available for use in case we want to use those means for further calculations.further calculations.

2. Data Management

13


Method 3Method 3

ttaabbllee rreeggiioonn,, cc((mm ppoopp))

Downside: Downside:

This method is great for presentation purposes: it shows exactly the information we want. One problem, however, is that the This method is great for presentation purposes: it shows exactly the information we want. One problem, however, is that the information is still not readily available for use in case we want to store the population means by region for further analyses. information is still not readily available for use in case we want to store the population means by region for further analyses.

2. Data Management

14


Method 4Method 4

ssyyssuussee cceennssuuss,, cclleeaarr

ccoollllaappssee ((mmeeaann)) ppoopp,, bbyy((rreeggiioonn))

Downside: Downside:

The The collapsecollapse command converts the dataset in memory into a set of means, standard deviations, and other summary stats. In our case, the new dataset now contains population means by command converts the dataset in memory into a set of means, standard deviations, and other summary stats. In our case, the new dataset now contains population means by region. All variables other than the collapsed variable (region. All variables other than the collapsed variable (poppop) and the grouping variable () and the grouping variable (regionregion) are NOT collapsed and hence disappear from dataset. Can we make any further analyses ) are NOT collapsed and hence disappear from dataset. Can we make any further analyses without the rest of the variables? without the rest of the variables? 2. Data Management

15


Method 5Method 5

ssysuse ysuse censuscensus, clear, clear

bby y region, region, sort: sort: egen egen meanpmeanpop = op = mean(mean(pop)pop)

Downside: Downside:

Do we really want an additional variable in the dataset that contains information on population means by region, a number that is repeated for each observation (state) within the same region? In very large datasets, one additional variable may Do we really want an additional variable in the dataset that contains information on population means by region, a number that is repeated for each observation (state) within the same region? In very large datasets, one additional variable may lead to memory constraints. Use scalars?lead to memory constraints. Use scalars?

2. Data Management

16

Reshaping DataReshaping Datasysuse bplong, clearsysuse bplong, clear

brbr

Suppose we want to take difference in bp Suppose we want to take difference in bp before and after treatment. Difficult to calculate before and after treatment. Difficult to calculate difference if data is organized in difference if data is organized in longlong format. format. Need to convert to Need to convert to widewide format. format.

reshape wide bp, i(patient sex agegrp) j(when)reshape wide bp, i(patient sex agegrp) j(when)

brbr

g bpdiff = bp2 – bp1g bpdiff = bp2 – bp12. Data Management

17

Value Labels (1/2)Value Labels (1/2)g gender = sexg gender = sexbrbr

Why do Why do gendergender and and sexsex look different? look different? Value Value labelslabels

Why use value labels? Why use value labels? * They save space (e.g., “0” instead of “male” for each obs.)* They save space (e.g., “0” instead of “male” for each obs.)

* More informative to the researcher (e.g., what region is 3?)* More informative to the researcher (e.g., what region is 3?)* Regression, lists, tables… display labels instead of values* Regression, lists, tables… display labels instead of values

table sex, c(m bp1 m bp2)table sex, c(m bp1 m bp2)table gender, c(m bp1 m bp2)table gender, c(m bp1 m bp2)

2. Data Management

18

Value Labels (2/2)Value Labels (2/2)label value gender sexlabel value gender sex /* note that /* note that sex sex refers to label, not var */refers to label, not var */

br patient sex genderbr patient sex gender

label value genderlabel value gender /* detaches /* detaches sexsex label from label from gendergender variable */ variable */

br pat sex gendbr pat sex gend

label define genderlbl 0 “man” 1 “woman”label define genderlbl 0 “man” 1 “woman”

label value gender genderlbllabel value gender genderlbl

br pat sex gendbr pat sex gend

What do the following commands do?What do the following commands do?

label define genderlbl 2 “na”, addlabel define genderlbl 2 “na”, add

label define genderlbl 0 “Man” 1 “Woman” 2 “NA”, modifylabel define genderlbl 0 “Man” 1 “Woman” 2 “NA”, modify

2. Data Management

19

Dummy Variables (1/3)Dummy Variables (1/3)

Suppose we want to create dummy vars Suppose we want to create dummy vars for each of the 4 regions in census for each of the 4 regions in census database:database:

g dum1 = 0g dum1 = 0

replace dum1 = 1 if region == 1replace dum1 = 1 if region == 1

……

What problems may these commands What problems may these commands lead to?lead to?

2. Data Management

20

Dummy Variables (2/3)Dummy Variables (2/3) To create four dummies, we need to type those To create four dummies, we need to type those

two commands four times.two commands four times. More importantly, the previous method More importantly, the previous method

generates 0s even when we have missing generates 0s even when we have missing values.values.

tab region, g(d)tab region, g(d)

This second method tabulates the variable This second method tabulates the variable regionregion, showing a list of the four regions, and , showing a list of the four regions, and correctly creates 4 separate dummies, correctly creates 4 separate dummies, accounting for missing values.accounting for missing values.

2. Data Management

21

Dummy Variables (3/3)Dummy Variables (3/3)

One more command that will be useful in One more command that will be useful in regressions:regressions:

xi i.region, noomitxi i.region, noomit

This third alternative yields the same This third alternative yields the same results as the results as the tabtab method described in method described in previous slide.previous slide.

2. Data Management

22

Merging Data (1/4)Merging Data (1/4)sysuse census, clearsysuse census, clearkeep state-popurbankeep state-popurbansort statesort state /* both master and using data must be sorted */ /* both master and using data must be sorted */ save census1, replacesave census1, replace

sysuse census, clearsysuse census, clearkeep state region medage-divorce keep state region medage-divorce /* note /* note regionregion is kept in both */ is kept in both */sort statesort statesave census2, replacesave census2, replace

use census1, clearuse census1, clearmerge state using census2merge state using census2 /* remember: both files must be sorted *//* remember: both files must be sorted */table _mergetable _merge /* /* _merge_merge keeps track of how good merge was keeps track of how good merge was

*/*/

2. Data Management

23

Merging Data (2/4)Merging Data (2/4)Important!!!Important!!!

If non-merging variable (e.g. If non-merging variable (e.g. regionregion) is in both files, ) is in both files, data on master file will be kept – while data on data on master file will be kept – while data on using file will be lost.using file will be lost.

use census1, clearuse census1, clear

l state region in 1l state region in 1

replace region = 2 in 1replace region = 2 in 1

sort statesort state

merge state using census2merge state using census2

table _mergetable _merge

l state region in 1l state region in 1 /* /* regionregion data in master file is kept */ data in master file is kept */

2. Data Management

24

Merging Data (3/4)Merging Data (3/4)Now suppose that each of the two databases contains information Now suppose that each of the two databases contains information about only SOME (non-overlapping) of the 50 states. Do we lose about only SOME (non-overlapping) of the 50 states. Do we lose information after merging the two datasets?information after merging the two datasets?


drop in 3/6drop in 3/6


save, replacesave, replace


drop in 22/23drop in 22/23


merge state using census2merge state using census2

table _mergetable _merge

2. Data Management

25

Merging Data (4/4)Merging Data (4/4)Finally, it’s important to note that, in case a variable has Finally, it’s important to note that, in case a variable has value labels attached in both datasets, labels attached to value labels attached in both datasets, labels attached to variables in master dataset prevail.variables in master dataset prevail.

This may cause serious trouble, for example, when we are This may cause serious trouble, for example, when we are merging datasets from surveys taken in different years and merging datasets from surveys taken in different years and for which the possible values in the answers may mean for which the possible values in the answers may mean different things.different things.

Example 1: Change in scale (1 to 4 in 1980; 1 to 5 in 1990).Example 1: Change in scale (1 to 4 in 1980; 1 to 5 in 1990).

Example 2: Omitted country in second survey, but all Example 2: Omitted country in second survey, but all countries, sorted in alphabetical order, are assigned countries, sorted in alphabetical order, are assigned consecutive values.consecutive values.

2. Data Management

26

Other Data Management Other Data Management IssuesIssues

Use StatTransfer software to convert Excel, SAS, Use StatTransfer software to convert Excel, SAS, SPSS, … into STATA.SPSS, … into STATA.

Use Use compresscompress command to make your dataset as command to make your dataset as small as possible and use less memory.small as possible and use less memory.

Some very large datasets won’t open in STATA due Some very large datasets won’t open in STATA due to STATA’s memory limitations. In this case, it is to STATA’s memory limitations. In this case, it is recommended that you open a subset of the recommended that you open a subset of the dataset, delete variables/observations that don’t dataset, delete variables/observations that don’t interest you and try again:interest you and try again:

use use varlistvarlist using using filenamefilename

2. Data Management

Analyzing Data: Make a ListDependent Variable(s) (response, outcome,

criterion)

Independent Variables (explanatory or predictor variables)

Treatment Variable Covariates / Confounding Variables

Categorical and Continuous Variables Remember: Types of variables determine the statistics we use

Time period

Scope and type of analysis

3. Analyzing Data 27

28

Analyzing Data: Graphs Analyzing Data: Graphs (1/2)(1/2)

Draw a histogram:Draw a histogram:sysuse auto, clearsysuse auto, clear

histogram pricehistogram price

Create a scatter plot:Create a scatter plot:scatter price mpgscatter price mpg

Draw line of best fit (linear Draw line of best fit (linear regression):regression):

twoway lfit price mpgtwoway lfit price mpg

Put two graphs together:Put two graphs together:twoway scatter price mpg || lfit price mpgtwoway scatter price mpg || lfit price mpg

3. Analyzing Data

29

Analyzing Data: Graphs Analyzing Data: Graphs (2/2)(2/2)

TypeType help graphshelp graphs to:to:

* create other graphs (pie and bar charts, * create other graphs (pie and bar charts, box plots, etc.);box plots, etc.);

* adjust graph settings (change labels, axes, * adjust graph settings (change labels, axes, colors…)colors…)

An easier (although less customizable) An easier (although less customizable) option is to use GRAPH menu.option is to use GRAPH menu.

3. Analyzing Data

Analyzing Data: Statistical Analysis (1/2)

Correlation: quantify relationships between variables

Regression: predict dependent variable from independent variable(s)

Group differences t-test & ANOVA Chi-square for categorical and frequency data

Significance v. effect sizeMore Complex Models

3. Analyzing Data 30

31

Analyzing Data: Statistical Analysis (2/2)

cor cor var1 var2var1 var2 gives the basic (Pearson) gives the basic (Pearson) correlation between two variables.correlation between two variables.

cor price mpgcor price mpg

regress regress var1 var2var1 var2 gives the effect of var2 on var1. gives the effect of var2 on var1.reg price mpgreg price mpg

Useful textbook for more on stats for social sciences: Useful textbook for more on stats for social sciences: Agresti, Alan, and Barbra Finlay (2008): Agresti, Alan, and Barbra Finlay (2008): Statistical Statistical

Methods for the Social SciencesMethods for the Social Sciences, Prentice Hall, 4, Prentice Hall, 4thth edition.edition.

Textbook examples with STATA: Textbook examples with STATA: http://www.ats.ucla.edu/stat/examples/smss/

3. Analyzing Data

http://www.ats.ucla.edu/stat/examples/smss/

Thank you!!

32

Intermediate STATA

Documents

Transcript of Intermediate STATA