How to Use Stata 11.0 Revised

30
How to Use STATA 11.0 This version: 2011-03-22 Spring 2011 (subject to be updated) Woochan Kim KDI School of Public Policy and Management

Transcript of How to Use Stata 11.0 Revised

Page 1: How to Use Stata 11.0 Revised

How to Use STATA 11.0

This version: 2011-03-22

Spring 2011

(subject to be updated)

Woochan Kim

KDI School of Public Policy and Management

Page 2: How to Use Stata 11.0 Revised

1

Table of Contents

I: How to Enter Data into STATA .......................................................... 3

II: How to Describe & Modify Variables ............................................. 6

III: How to Tabulate Variables ............................................................... 9

IV: How to Graph Variables ................................................................. 11

SCATTER PLOTS ................................................................................................................................... 11

HISTOGRAMS ........................................................................................................................................ 12

BAR CHARTS ......................................................................................................................................... 12

LINE CHARTS ........................................................................................................................................ 13

HOW TO COPY AND PASTE GRAPHS ................................................................................................. 13

V: Hypotheses Testing ........................................................................... 14

DISTRIBUTIONS ................................................................................................................................... 14

ONE SAMPLE T-TEST ........................................................................................................................... 14

TWO SAMPLE T-TEST ........................................................................................................................... 15

VI: Regression .......................................................................................... 17

CORRELATION ..................................................................................................................................... 17

SIMPLE OLS .......................................................................................................................................... 17

FITTED LINE IN SCATTER PLOT ........................................................................................................ 17

MULTIVARIATE REGRESSION ............................................................................................................ 18

CONTROL .............................................................................................................................................. 19

COLLINEARITY ..................................................................................................................................... 20

RIGHT HAND SIDE DUMMY VARIABLES .......................................................................................... 21

VII: More on Regressions ...................................................................... 22

F-TESTS ................................................................................................................................................. 22

REGRESSION DIAGNOSTICS .............................................................................................................. 23

Page 3: How to Use Stata 11.0 Revised

2

HETEROSCEDASTICITY ...................................................................................................................... 23

SERIAL-CORRELATION ....................................................................................................................... 23

VIII: Using Log Files and Do Files ...................................................... 25

LOG FILES ............................................................................................................................................ 25

DO FILES ............................................................................................................................................... 26

IX: Other Useful Commands ................................................................. 27

LARGE FILES ........................................................................................................................................ 27

MERGE FILES ....................................................................................................................................... 27

RESHAPE A FILE .................................................................................................................................. 29

Page 4: How to Use Stata 11.0 Revised

3

I: How to Enter Data into STATA

Double click the STATA 10.0 icon in the Window

When STATA program is open, choose the interface preference by going to the “Edit,” and then

the “Preferences” menu. I personally prefer the simple “Factory Settings.” You can also

relocate the “command,” the “review,” the “variables,” and the “results“ windows to your-

liking and save your window preference. By going into “General Preferences,” you can also

change the collar and the font sizes.

Entering data into STATA

The easiest way to enter data into STATA is making use of copy & paste. If the data set is in an

Excel format, just copy the area of your interest, open the STATA editor (look for the icon under

the menu bar), place the cursor on the cell at the upper left corner, and paste. You can

experiment this by downloading the cash.xlsx data from the e-education site. Save the file in

the c:\stata10 directory.

clear

Drops any data in memory

input id mpg weight price

“Input” allows one to enter data using the command window. The specific command above

allows one to input data with variables named “id” “mpg” “weight” “price”

Type in the following data for each observation

1 22 2930 4099

2 17 3350 4749

3 22 2640 3799

One must type in “end” to finish inputting

save c:\stata10\auto1

Saves the data in a file titled “auto1” in directory “c:\stata10” (you can save the file in any other

directory of your choice). The data will be saved with the “dta” extension. You can also use

the pull-down menu. Go to “File” and then click “Save.”

edit

This command allows one to input data using the “Editor.” One can also simply click the

Page 5: How to Use Stata 11.0 Revised

4

“Edit” icon under the menu bar. Type in the following data for each observation

1 22 2930 4099

2 17 3350 4749

3 22 2640 3799

Click “preserve” located at the upper left corner

Click “restore” to undo changes

rename var1 id

Change the name of the variable from “var1” to “id.” You can also use the pull-down menu.

Go to “Data,” and then click “Variable Utilities.”

save c:\stata10\auto1, replace

Replaces the old data titled “auto1” with a new one also titled “auto1”

Save cash.xlsx as a comma-separated file with the raw extension (cash.raw)

Open the cash.csv data and save it as a comma-separated file with the raw extension.

insheet using c:\stata10\cash

This command imports the cash.csv file into STATA. You can also use the pull-down menu.

Go to “File,” and then “Import,” and then click “ASCII dataset created by a spreadsheet.”

save c:\stata10\cash

clear

use c:\stata10\cash

This common opens the STATA file. You can also use the pull-down menu. Go to “File” and

then click “Open.”

browse

This command opens the browser window and allows one to directly see the dataset. You can

also simply click the browse icon under the menu bar.

outsheet using c:\stata10\cash

Writes data in an ASCII text file and saves it in a file named c:\stata10\cash.out. You can also

use the pull-down menu. Go to “File” and then “Export” and then click “ASCII Text.” If you

Page 6: How to Use Stata 11.0 Revised

5

wish to save it a comma-separated data, click “Comma or tab-separated data.”

exist

Exist from STATA

Page 7: How to Use Stata 11.0 Revised

6

II: How to Describe & Modify Variables

clear

Download sample datasets from STATA

Go to “File” and then to “Example Datasets.” Click “Example datasets installed with Stata.”

Choose “auto.dta” and then click “use” to download.

save c:\stata10\auto

describe (or enter F3)

Shows what is in data c:\stata\auto.dta. You can see the number of variables & observations.

You can also see the name & description of each variable.

browse

This command opens the browser window and allows one to directly see the dataset. You can

also simply click the browse icon under the menu bar. Inside the browser window, you can

see the icons that let you “sort” observations within a given variable or hide/relocate specific

variables.

edit

This command opens the editor window and allows one to directly edit the dataset. Inside the

browser window, you can see an icon that lets you delete a specific variable. Click “preserve”

located at the upper left corner to fix the changes made and click “restore” to undo changes.

summarize rep78 (or sum rep78)

This command gives a summary statistics of “rep78.” It shows the number of observations,

the mean, the standard deviation, the minimum value, and the maximum value (Rep78: Repair

record in 1978).

summarize rep78, detail

This command gives a more detailed description of the variable. It shows the label and the 1st,

the 5th, the 10th, the 25th, the 50th, the 75th, the 90th, the 99th percentile values. It also gives the

largest/smallest four figures, the skewness, and the kurtosis. Kurtosis measures the degree of

peakedness of a distribution or the degree of fat tails of a distribution.

Page 8: How to Use Stata 11.0 Revised

7

gen mpgsq=mpg^2

This command generates a new variable named “mpgsq,” which is the squared value of mpg

(mpg: mileage per gallon). You can also use the pull-down menu by going to “Data,” and then

to “Create or change variables,” and then by clicking “Crate new variable.”

rename mpgsq mpgs

This command changes the variable’s name from “mpgsq” to “mpgs.” You can also use the

pull-down menu by going to “Data,” and then to “Variable utilities,” and then clicking

“Rename variable.”

drop mpgs

This command drops the variable “mpgs” from the data set. You can also use the pull-down

menu by going to “Data,” and then to “Variable utilities,” and then clicking “Keep or drop

variable.”

egen mmpg=mean(mpg)

This command allows one to generate a new variable using a special function. Specifically, it

generates a variable named “mmpg,” which takes the mean value of variable “mpg.” You can

also use the pull down menu by going to “Data,” and then to “Create or change variables,” and

then clicking “Crate new variable (extended).”

help egen

This command gives one a detailed description of “egen.” You can also use the pull-down

menu by going to “Help,” and then clicking “Search.” In the search window, click “egen.”

drop mmpg

sort price

This sorts the observations in ascending order of “price.” You can also use the pull-down

menu by going to “Data,” then to “Sort,” and then clicking “Ascending sort.”

gsort -price

This sorts the observations in descending order of “price.” You can also use the pull-down

menu by going to “Data,” then to “Sort,” and then clicking “Ascending and descending sort.”

drop if price>=5000

Page 9: How to Use Stata 11.0 Revised

8

This drops observation if the value of variable “price” is equal or greater than 5,000. You can

also use the pull-down menu by going to “Data,” and then to “Variable utilities,” and then

clicking “Keep or drop observations.”

save c:\stata10\auto_below

You can also go to “File” and then “Save as.”

use c:\stata10\auto

You can also go to “File” and then “Open.”

drop if price<5000

save c:\stata10\auto_above

append using c:\stata10\auto5

It stacks data “auto_above” on top to data “auto_below.” You can also go to “Data,” and then

to “Combine datasets,” and then to “Append datasets.”

Page 10: How to Use Stata 11.0 Revised

9

III: How to Tabulate Variables

sysuse dir

List the names of data sets shipped with STATA

sysuse auto, clear

Loads auto.dta that is shipped with STATA

The “clear” option clears any file already in the memory

If this does not work, you can search the dataset by clicking “File” and “Example Datasets”

FREQUENCY TABLE

tabulate rep78 (or tab rep78)

Shows frequency, percentage, cumulative percentage of each categorical value. You can also

use the pull-down menu by clicking “Statistics” and then “Summaries, Tables, and Tests.”

tabulate rep78, plot (or tab rep78, plot)

Shows frequency of each categorical value

The “plot” option plots frequency

CROSS TABLE

tabulate rep78 foreign

Generates a frequency table of both variables “rep78” and “foreign”

tabulate rep78 foreign, column

Generates a frequency table of both variables “rep78” and “foreign” with column percent, i.e.

the percent each “rep78” category is of the overall

tabulate rep78 foreign, row

Generates a frequency table of both variables “rep78” and “foreign” with row percent, i.e. the

percent each “foreign” category is of the overall

tabulate rep78 foreign, cell

Generates a frequency table of both variables “rep78” and “foreign” with cell percent, i.e. the

Page 11: How to Use Stata 11.0 Revised

10

percent each “rep78” and “foreign” category is of the overall

tabulate rep78 foreign, column nofreq

Generates a frequency table of both variables “rep78” and “foreign” with column percent, but

suppress actual count

tabulate foreign, summarize (rep78)

Compares the mean, standard deviation, and frequency of “rep78” for each category of “foreign”

tabulate rep78 foreign, summarize (weight)

Compares the mean, standard deviation, and frequency of “weight” for each category of “rep78”

and “foreign”

Page 12: How to Use Stata 11.0 Revised

11

IV: How to Graph Variables

SCATTER PLOTS

sysuse auto, clear

scatter price weight

This command draws a scatter plot between variable “weight” and “price.” You can also use

the pull-down menu by going to “Graphics,” and then to “Twoway graphs.”

sort foreign

Sort observations by the value of “foreign”

scatter weight price, by(foreign )

This command draws a scatter plot between “weight” and “price” for each value of “foreign” in

multiple plots. You can also use the “By” tab in the “Twoway graphs” window.

scatter mpg displ

Draws a scatter plot between “mpg” and “displ”

Displ: displacement

graph save c:\stata10\scatter

It saves the scatter plot in a file named scatter.gph. You can also use the pull-down menu by

going to “File” and then to “Save Graph.”

graph use c:\stata10\scatter

It reads the saved scatter plot from the file named scatter.gph. You can also use the pull-down

menu by going to “File” and then to “Open Graph.”

Graph Editor

This allows editing the graph in a variety of ways. Click the “Object Browser.” Try to change

the size, the color, and the type of market. Try to change the x- and y-axis titles. Try to add

the title and subtitles. Try to add a marker or a line.

Apply new schemes

Page 13: How to Use Stata 11.0 Revised

12

Go to “Edit” and then click “Apply New Schemes.” This changes the style of the graph.

Try s1 mono

Try STATA Journal

Try Economist

scatter mpg displ, yline(25)

Overlays a horizontal line at 25. You can do the same by using the Graph Editor.

scatter mpg displ, mlabel(make)

It shows the “make” of each observation.

graph matrix displ weight gear_ratio

Draws a scatter plot matrix

HISTOGRAMS

histogram (or hist) mpg

It draws a histogram of mpg. You can also use the pull-down menu by going to “Graphics,”

and then to “Histogram.”

histogram (or hist) mpg, bin(15)

Draws a histogram of mpg and fixes the number of bins to be 15.

histogram mpg, bin(15) normal

Draws a histogram using 15 bins with an overlaid normal distribution curve

histogram mpg, bin(15) normal by(foreign)

Draws two histograms: one for foreigners and the other for domestics

BAR CHARTS

sysuse citytemp, clear

Loads STATA sample data set named City Temperature Data

graph bar (mean) tempjuly tempjan, over(region)

It draws a bar chart of temperature in July and January by region. You can also use the pull-

Page 14: How to Use Stata 11.0 Revised

13

down menu by going to “Graphics,” and then to “Bar chart.”

LINE CHARTS

sysuse uslifeexp, clear

Loads STATA sample data set named U.S. Life Expectancy Data

line le year

It draws a line chart. You can also use the pull-down menu by going to “Graphics,” and then to

“Twoway graph.”

le: life expectancy

HOW TO COPY AND PASTE GRAPHS

While the graph window is open, go to menu EDIT and click COPY GRAPH

Go to a your Word file and PASTE

Page 15: How to Use Stata 11.0 Revised

14

V: Hypotheses Testing

DISTRIBUTIONS

sysuse auto, clear

gen pvalue=normal(-1.645)

It returns the left-tail p-value when the critical value is -1.645 under the standard normal

distribution curve. That is, P[-∞ < Z < -1.645]

list pvalue in 1

gen zvalue=invnormal(0.05)

Returns the left-tail z-value when the p-value is 0.05 under the standard normal distribution

curve. That is, if normal(z) = 0.05 then invnormal(0.05) = z.

list zvalue in 1

gen pvalue2=ttail(30, 1.645)

Returns the right-tail p-value when the critical value is 1.645 under the t-distribution curve with

30 degrees of freedom. This is, P[T > 1.645] if degrees of freedom is 30

list pvalue2 in 1

gen tvalue=invttail(30, 0.05)

Returns the right-tail t-value when the p-value is 0.05 under the t-distribution curve with 30

degrees of freedom. That is, if ttail(30, t) = 0.05, then invttail(30, 0.05) = t.

list tvalue in 1

ONE SAMPLE T-TEST

sysuse auto, clear

ttest mpg=20

Page 16: How to Use Stata 11.0 Revised

15

One-sample t-test

PAIRED SAMPLE T-TEST

ttest price=length

Paired t-test

TWO SAMPLE T-TEST

ttest price=length, unpair

Unpaired two sample t-test with equal variance

ttest price=length, unpair unequal

Unpaired two sample t-test with unequal variance (by Satterwaite’s degrees of freedom)

TWO GROUP T-TEST

sort foreign

ttest mpg, by(foreign)

Unpaired two population t-test with equal variance

ttest mpg, by(foreign) unequal

Unpaired two population t-test with unequal variances

sysuse census, clear

Loads 1980 census data by state

sort region

ttest medage if region==1 | region==4, by(region)

Unpaired two population t-test with equal variance

sysuse auto, clear

xtile quart=price, nq(4)

Page 17: How to Use Stata 11.0 Revised

16

Group price into quartiles

tab quart

sort quart

by quart: sum price

ttest weight if quart==1 | quart==4, by(quart)

Page 18: How to Use Stata 11.0 Revised

17

VI: Regression

CORRELATION

sysuse auto, clear

corr displ weight rep78

Uses observations that exist for all the variables

pwcorr displ weight rep78

pwcorr displ weight rep78, obs

corr displ weight rep78, covariance

pwcorr displ weight rep78, sig

pwcorr displ weight rep78, sig obs

pwcorr displ weight rep78, star(0.05)

SIMPLE OLS

sysuse auto, clear

correlate weight length (or corr weight length)

Computes a correlation coefficient between variable “weight” and “length”

reg weight length

It runs a regress with a constant. You can also use the pull-down menu by clicking “Statistics”

and then “Linear Models and Related,” and then “Linear Regression.”

reg weight length, nocons

Run regress without a constant

FITTED LINE IN SCATTER PLOT

Page 19: How to Use Stata 11.0 Revised

18

reg mpg displ

Regress “mpg” on “displ” and a constant

predict pmpg

Obtain predicted values from the regression

twoway (scatter mpg displ) (line pmpg displ)

Overlay a fitted line over the scatter plot

MULTIVARIATE REGRESSION

reg price mpg rep78 headroom trunk weight

outreg2 using c:\tabel_1, word excel replace

This automatically generates a table in word/excel files.

Simply, open the file from your word or excel.

If outreg2 is not installed, go to search and type in “outreg.” Download sg97.3.

tab rep78, gen(repair)

reg price mpg foreign weight repair1-repair4

outreg2 weight foreign using c:\tabel_1, word excel replace

The table does not report results for dummy variables.

outreg2 using c:\tabel_1, drop(repair*) word excel replace

Same result.

outreg2 using c:\tabel_1, keep(weight foreign) word excel replace

Same result.

reg price mpg rep78 headroom trunk weight

outreg2 using c:\tabel_1, bdec(2) tstat adjr2 rdec(2) word excel replace

Report t-stat (instead of standard error), adjusted R squared (instead of R squared), coefficients

Page 20: How to Use Stata 11.0 Revised

19

in two decimals, adjusted R squared in two decimals.

reg price mpg rep78 headroom

outreg2 using c:\tabel_1, bdec(2) tstat adjr2 rdec(2) word excel

Without option “replace,” the result will show up in the second column.

CONTROL

reg price foreign

This will show no relationship between ‘foreign’ and ‘price’

But what if foreign cars are lighter and that is pulling down the price? What would be the

relationship between ‘price’ and ‘foreign’ among the cars with similar car weight?

pwcorr foreign weight

sum weight, detail

gen dum_w=1

replace dum_w=2 if weight>=2230

replace dum_w=3 if weight>=3190

replace dum_w=4 if weight>=3600

sort dum_w

by dum_w: reg price foreign

What are the coefficients on foreign?

Last regression is not estimated because there is no variation in the ‘foreign’ variable. That is,

all cars in the category (“heavy cars”) are domestic cars. Recall the formula for coefficient

variance.

reg price foreign weight

What happens to the coefficient on foreign?

Page 21: How to Use Stata 11.0 Revised

20

The concept of control

reg price foreign

This will show no relationship between ‘foreign’ and ‘price’

But what if foreign cars are shorter and that is pulling down the price? What would be the

relationship between ‘price’ and ‘foreign’ among the cars with similar length?

pwcorr price foreign length

sum length, detail

gen dum_l=1

replace dum_l=2 if length>=170

replace dum_l=3 if length>=193

replace dum_l=4 if length>=204

sort dum_l

by dum_l: reg price foreign

What are the coefficients on foreign?

Last regression is not estimated because there is no variation in the ‘foreign’ variable. That is,

all cars in the category (“heavy cars”) are domestic cars. Recall the formula for coefficient

variance.

reg price foreign length

What happens to the coefficient on foreign?

The concept of control

COLLINEARITY

gen domestic=~foreign

Generate a variable named domestic

Page 22: How to Use Stata 11.0 Revised

21

reg price weight mpg foreign domestic

Case of perfect collinearity

What happens to the variable domestic?

reg price displ

reg price displ weight

Compare the t-values

corr displ weight

Case of multicollinearity

reg price displ weight

vif

Displays VIF and 1/VIF for each right-hand-side variable

VIF = 1/(1-R2) from a regression where the variable of concern is on the left-hand-side and all

other independent variables are on the right-hand-side (if above 5, evidence of multicollinearity)

RIGHT HAND SIDE DUMMY VARIABLES

sysuse auto, clear

xi: reg mpg weight i.rep78

Regression with multiple dummy variables

xi: reg mpg i.foreign|weight

Regression with a “foreign” dummy variable that interacts with a continuous variable

xi: reg mpg i.foreign*weight

Regression with a “foreign” intercept dummy and a “foreign” dummy variable that interacts

with a continuous variable

xi: reg mpg i.foreign*i.rep78

Regression with dummy variables interacting with each other

Page 23: How to Use Stata 11.0 Revised

22

VII: More on Regressions

F-TESTS

reg mpg foreign weight length

test length=0

F-test if coefficient on length is zero

Compare p-value from t-test

test weight=length

F-test if weight=length

test (foreign/100)-length=weight

F-test if (foreign/100)-length=weight

test foreign=0

F-test if coefficient on foreign is zero

test foreign=0

test length=0, accumulate

Joint hypotheses testing

F-test if coefficients on “foreign” and “length” are zero

test foreign=0

test length=0, accumulate

test weight=0, accumulate

F-test if coefficients on “foreign,” “length,” and “weight” are zero

Compare it with goodness of fit test

test foreign weight length

Same as above

test

Shows the last test again

Page 24: How to Use Stata 11.0 Revised

23

REGRESSION DIAGNOSTICS

reg mpg foreign weight length

avplot weight

Draws added-variable plots (also known as partial regression line)

Partial regression line: fitted line between two residuals

One residual is estimated by regressing mpg on foreign and length

The other residual is estimated by regressing weight on foreign and length

avplots

Draws added-variable plots for all independent variables in a single graph

reg price displ

reg price displ, beta

Standardized coefficients

reg price weight mpg foreign

predict estu if e(sample), rstudent

Generates studentized residuals

list estu if abs(estu)>1.96

Shows studentized residuals if its absolute value is greater than 1.96

reg price weight mpg foreign if abs(estu)<1.96

HETEROSCEDASTICITY

reg mpg weight

rvfplot, yline(0)

SERIAL-CORRELATION

Page 25: How to Use Stata 11.0 Revised

24

sysuse uslifeexp, clear

Load sample STATA file named US Life Expectancy

reg le_male le_female

rvfplot, yline(0)

Residual-versus-fitted plot

Page 26: How to Use Stata 11.0 Revised

25

VIII: Using Log Files and Do Files

LOG FILES

sysuse auto, clear

log using c:\stata10\task_1, text

Start a log file and save it in a text file.

Without any option, save in a file with an extension of “smcl.” This file can be viewed from

STATA by clicking the log icon.

sum rep78, detail

log close

Saves the task of summarizing variable “rep78” in a log file named task_1.log

Can open the log file using MS-Word

sysuse auto, clear

log using c:\stata10\task_1, text replace

Replace the old log file

sum rep78, detail

log off

Temporarily stop the log file

gen w_l=weight*length

log on

Restart the log file

sum w_l, detail

log close

Saves the task of summarizing variable “rep78” and “w_l” in a log file named task_1.log

Page 27: How to Use Stata 11.0 Revised

26

DO FILES

Open do-file editor

Type in the commands that you want to execute

sysuse auto, clear

log using c:\stata10\task_3, text

reg mpg length weight foreign

sort foreign

by foreign: reg mpg length weight foreign

log close

Save the file with an extension “do.” For example, c:\stata10\task_2.do

do c:\stata10\task_2.do (or run c:\stata10\task_2.do)

Executes the do file

Run is different in that it does not display the execution in the screen

Use “do” instead of “run” when generating a log file

Page 28: How to Use Stata 11.0 Revised

27

IX: Other Useful Commands

LARGE FILES

clear

set memory 512000

Sets the amount of memory STATA will use to be 512000 kilobytes

set memory 512000, permanently Specifies that in addition to making the change right now, Stata will remember the new limit and use it in the future.

set maxvar 8000 Sets the maximum number of variables that can be included in any of Stata's estimation commands

set maxvar 8000, permanently Specifies that, in addition to making the change right now, Stata will remember the new limit and use it in the future.

query memory

Displays memory settings

MERGE FILES

sysuse auto

keep make price mpg

sort make

Must sort before merging

save c:\stata10\auto_1

browse

sysuse auto

Page 29: How to Use Stata 11.0 Revised

28

drop price mpg

sort make

Must sort before merging

save c:\stata10\auto_2

merge 1:1 make using c:\stata10\auto_1

1:1 → one-to-one merge

m:1 → many -to-one merge

1:m → one-to- many merge

m:m → many-to- many merge

tab _merge

browse

use c:\stata10\auto_1

drop if make==”Audi Fox”

sort make

save, replace

browse

use c:\stata10\auto_2

drop if make==”BMW 320i”

drop if make==”Buick Opel”

sort make

Page 30: How to Use Stata 11.0 Revised

29

save, replace

merge 1:1 make using c:\stata10\auto_1

If _merge==1, the original file has the observation, but the merging file does not

If _merge==2, the merging file has the observation, but the original file does not

If _merge==3, both files have the observation

tab _merge

browse

RESHAPE A FILE

sysuse bplong

reshape wide bp, i(patient) j(when)

Reshapes the file to be wide. That is, time-variable is displayed in rows