How to Use Stata 11.0 Revised
-
Upload
porteger705 -
Category
Documents
-
view
135 -
download
2
Transcript of How to Use Stata 11.0 Revised
How to Use STATA 11.0
This version: 2011-03-22
Spring 2011
(subject to be updated)
Woochan Kim
KDI School of Public Policy and Management
1
Table of Contents
I: How to Enter Data into STATA .......................................................... 3
II: How to Describe & Modify Variables ............................................. 6
III: How to Tabulate Variables ............................................................... 9
IV: How to Graph Variables ................................................................. 11
SCATTER PLOTS ................................................................................................................................... 11
HISTOGRAMS ........................................................................................................................................ 12
BAR CHARTS ......................................................................................................................................... 12
LINE CHARTS ........................................................................................................................................ 13
HOW TO COPY AND PASTE GRAPHS ................................................................................................. 13
V: Hypotheses Testing ........................................................................... 14
DISTRIBUTIONS ................................................................................................................................... 14
ONE SAMPLE T-TEST ........................................................................................................................... 14
TWO SAMPLE T-TEST ........................................................................................................................... 15
VI: Regression .......................................................................................... 17
CORRELATION ..................................................................................................................................... 17
SIMPLE OLS .......................................................................................................................................... 17
FITTED LINE IN SCATTER PLOT ........................................................................................................ 17
MULTIVARIATE REGRESSION ............................................................................................................ 18
CONTROL .............................................................................................................................................. 19
COLLINEARITY ..................................................................................................................................... 20
RIGHT HAND SIDE DUMMY VARIABLES .......................................................................................... 21
VII: More on Regressions ...................................................................... 22
F-TESTS ................................................................................................................................................. 22
REGRESSION DIAGNOSTICS .............................................................................................................. 23
2
HETEROSCEDASTICITY ...................................................................................................................... 23
SERIAL-CORRELATION ....................................................................................................................... 23
VIII: Using Log Files and Do Files ...................................................... 25
LOG FILES ............................................................................................................................................ 25
DO FILES ............................................................................................................................................... 26
IX: Other Useful Commands ................................................................. 27
LARGE FILES ........................................................................................................................................ 27
MERGE FILES ....................................................................................................................................... 27
RESHAPE A FILE .................................................................................................................................. 29
3
I: How to Enter Data into STATA
Double click the STATA 10.0 icon in the Window
When STATA program is open, choose the interface preference by going to the “Edit,” and then
the “Preferences” menu. I personally prefer the simple “Factory Settings.” You can also
relocate the “command,” the “review,” the “variables,” and the “results“ windows to your-
liking and save your window preference. By going into “General Preferences,” you can also
change the collar and the font sizes.
Entering data into STATA
The easiest way to enter data into STATA is making use of copy & paste. If the data set is in an
Excel format, just copy the area of your interest, open the STATA editor (look for the icon under
the menu bar), place the cursor on the cell at the upper left corner, and paste. You can
experiment this by downloading the cash.xlsx data from the e-education site. Save the file in
the c:\stata10 directory.
clear
Drops any data in memory
input id mpg weight price
“Input” allows one to enter data using the command window. The specific command above
allows one to input data with variables named “id” “mpg” “weight” “price”
Type in the following data for each observation
1 22 2930 4099
2 17 3350 4749
3 22 2640 3799
One must type in “end” to finish inputting
save c:\stata10\auto1
Saves the data in a file titled “auto1” in directory “c:\stata10” (you can save the file in any other
directory of your choice). The data will be saved with the “dta” extension. You can also use
the pull-down menu. Go to “File” and then click “Save.”
edit
This command allows one to input data using the “Editor.” One can also simply click the
4
“Edit” icon under the menu bar. Type in the following data for each observation
1 22 2930 4099
2 17 3350 4749
3 22 2640 3799
Click “preserve” located at the upper left corner
Click “restore” to undo changes
rename var1 id
Change the name of the variable from “var1” to “id.” You can also use the pull-down menu.
Go to “Data,” and then click “Variable Utilities.”
save c:\stata10\auto1, replace
Replaces the old data titled “auto1” with a new one also titled “auto1”
Save cash.xlsx as a comma-separated file with the raw extension (cash.raw)
Open the cash.csv data and save it as a comma-separated file with the raw extension.
insheet using c:\stata10\cash
This command imports the cash.csv file into STATA. You can also use the pull-down menu.
Go to “File,” and then “Import,” and then click “ASCII dataset created by a spreadsheet.”
save c:\stata10\cash
clear
use c:\stata10\cash
This common opens the STATA file. You can also use the pull-down menu. Go to “File” and
then click “Open.”
browse
This command opens the browser window and allows one to directly see the dataset. You can
also simply click the browse icon under the menu bar.
outsheet using c:\stata10\cash
Writes data in an ASCII text file and saves it in a file named c:\stata10\cash.out. You can also
use the pull-down menu. Go to “File” and then “Export” and then click “ASCII Text.” If you
5
wish to save it a comma-separated data, click “Comma or tab-separated data.”
exist
Exist from STATA
6
II: How to Describe & Modify Variables
clear
Download sample datasets from STATA
Go to “File” and then to “Example Datasets.” Click “Example datasets installed with Stata.”
Choose “auto.dta” and then click “use” to download.
save c:\stata10\auto
describe (or enter F3)
Shows what is in data c:\stata\auto.dta. You can see the number of variables & observations.
You can also see the name & description of each variable.
browse
This command opens the browser window and allows one to directly see the dataset. You can
also simply click the browse icon under the menu bar. Inside the browser window, you can
see the icons that let you “sort” observations within a given variable or hide/relocate specific
variables.
edit
This command opens the editor window and allows one to directly edit the dataset. Inside the
browser window, you can see an icon that lets you delete a specific variable. Click “preserve”
located at the upper left corner to fix the changes made and click “restore” to undo changes.
summarize rep78 (or sum rep78)
This command gives a summary statistics of “rep78.” It shows the number of observations,
the mean, the standard deviation, the minimum value, and the maximum value (Rep78: Repair
record in 1978).
summarize rep78, detail
This command gives a more detailed description of the variable. It shows the label and the 1st,
the 5th, the 10th, the 25th, the 50th, the 75th, the 90th, the 99th percentile values. It also gives the
largest/smallest four figures, the skewness, and the kurtosis. Kurtosis measures the degree of
peakedness of a distribution or the degree of fat tails of a distribution.
7
gen mpgsq=mpg^2
This command generates a new variable named “mpgsq,” which is the squared value of mpg
(mpg: mileage per gallon). You can also use the pull-down menu by going to “Data,” and then
to “Create or change variables,” and then by clicking “Crate new variable.”
rename mpgsq mpgs
This command changes the variable’s name from “mpgsq” to “mpgs.” You can also use the
pull-down menu by going to “Data,” and then to “Variable utilities,” and then clicking
“Rename variable.”
drop mpgs
This command drops the variable “mpgs” from the data set. You can also use the pull-down
menu by going to “Data,” and then to “Variable utilities,” and then clicking “Keep or drop
variable.”
egen mmpg=mean(mpg)
This command allows one to generate a new variable using a special function. Specifically, it
generates a variable named “mmpg,” which takes the mean value of variable “mpg.” You can
also use the pull down menu by going to “Data,” and then to “Create or change variables,” and
then clicking “Crate new variable (extended).”
help egen
This command gives one a detailed description of “egen.” You can also use the pull-down
menu by going to “Help,” and then clicking “Search.” In the search window, click “egen.”
drop mmpg
sort price
This sorts the observations in ascending order of “price.” You can also use the pull-down
menu by going to “Data,” then to “Sort,” and then clicking “Ascending sort.”
gsort -price
This sorts the observations in descending order of “price.” You can also use the pull-down
menu by going to “Data,” then to “Sort,” and then clicking “Ascending and descending sort.”
drop if price>=5000
8
This drops observation if the value of variable “price” is equal or greater than 5,000. You can
also use the pull-down menu by going to “Data,” and then to “Variable utilities,” and then
clicking “Keep or drop observations.”
save c:\stata10\auto_below
You can also go to “File” and then “Save as.”
use c:\stata10\auto
You can also go to “File” and then “Open.”
drop if price<5000
save c:\stata10\auto_above
append using c:\stata10\auto5
It stacks data “auto_above” on top to data “auto_below.” You can also go to “Data,” and then
to “Combine datasets,” and then to “Append datasets.”
9
III: How to Tabulate Variables
sysuse dir
List the names of data sets shipped with STATA
sysuse auto, clear
Loads auto.dta that is shipped with STATA
The “clear” option clears any file already in the memory
If this does not work, you can search the dataset by clicking “File” and “Example Datasets”
FREQUENCY TABLE
tabulate rep78 (or tab rep78)
Shows frequency, percentage, cumulative percentage of each categorical value. You can also
use the pull-down menu by clicking “Statistics” and then “Summaries, Tables, and Tests.”
tabulate rep78, plot (or tab rep78, plot)
Shows frequency of each categorical value
The “plot” option plots frequency
CROSS TABLE
tabulate rep78 foreign
Generates a frequency table of both variables “rep78” and “foreign”
tabulate rep78 foreign, column
Generates a frequency table of both variables “rep78” and “foreign” with column percent, i.e.
the percent each “rep78” category is of the overall
tabulate rep78 foreign, row
Generates a frequency table of both variables “rep78” and “foreign” with row percent, i.e. the
percent each “foreign” category is of the overall
tabulate rep78 foreign, cell
Generates a frequency table of both variables “rep78” and “foreign” with cell percent, i.e. the
10
percent each “rep78” and “foreign” category is of the overall
tabulate rep78 foreign, column nofreq
Generates a frequency table of both variables “rep78” and “foreign” with column percent, but
suppress actual count
tabulate foreign, summarize (rep78)
Compares the mean, standard deviation, and frequency of “rep78” for each category of “foreign”
tabulate rep78 foreign, summarize (weight)
Compares the mean, standard deviation, and frequency of “weight” for each category of “rep78”
and “foreign”
11
IV: How to Graph Variables
SCATTER PLOTS
sysuse auto, clear
scatter price weight
This command draws a scatter plot between variable “weight” and “price.” You can also use
the pull-down menu by going to “Graphics,” and then to “Twoway graphs.”
sort foreign
Sort observations by the value of “foreign”
scatter weight price, by(foreign )
This command draws a scatter plot between “weight” and “price” for each value of “foreign” in
multiple plots. You can also use the “By” tab in the “Twoway graphs” window.
scatter mpg displ
Draws a scatter plot between “mpg” and “displ”
Displ: displacement
graph save c:\stata10\scatter
It saves the scatter plot in a file named scatter.gph. You can also use the pull-down menu by
going to “File” and then to “Save Graph.”
graph use c:\stata10\scatter
It reads the saved scatter plot from the file named scatter.gph. You can also use the pull-down
menu by going to “File” and then to “Open Graph.”
Graph Editor
This allows editing the graph in a variety of ways. Click the “Object Browser.” Try to change
the size, the color, and the type of market. Try to change the x- and y-axis titles. Try to add
the title and subtitles. Try to add a marker or a line.
Apply new schemes
12
Go to “Edit” and then click “Apply New Schemes.” This changes the style of the graph.
Try s1 mono
Try STATA Journal
Try Economist
scatter mpg displ, yline(25)
Overlays a horizontal line at 25. You can do the same by using the Graph Editor.
scatter mpg displ, mlabel(make)
It shows the “make” of each observation.
graph matrix displ weight gear_ratio
Draws a scatter plot matrix
HISTOGRAMS
histogram (or hist) mpg
It draws a histogram of mpg. You can also use the pull-down menu by going to “Graphics,”
and then to “Histogram.”
histogram (or hist) mpg, bin(15)
Draws a histogram of mpg and fixes the number of bins to be 15.
histogram mpg, bin(15) normal
Draws a histogram using 15 bins with an overlaid normal distribution curve
histogram mpg, bin(15) normal by(foreign)
Draws two histograms: one for foreigners and the other for domestics
BAR CHARTS
sysuse citytemp, clear
Loads STATA sample data set named City Temperature Data
graph bar (mean) tempjuly tempjan, over(region)
It draws a bar chart of temperature in July and January by region. You can also use the pull-
13
down menu by going to “Graphics,” and then to “Bar chart.”
LINE CHARTS
sysuse uslifeexp, clear
Loads STATA sample data set named U.S. Life Expectancy Data
line le year
It draws a line chart. You can also use the pull-down menu by going to “Graphics,” and then to
“Twoway graph.”
le: life expectancy
HOW TO COPY AND PASTE GRAPHS
While the graph window is open, go to menu EDIT and click COPY GRAPH
Go to a your Word file and PASTE
14
V: Hypotheses Testing
DISTRIBUTIONS
sysuse auto, clear
gen pvalue=normal(-1.645)
It returns the left-tail p-value when the critical value is -1.645 under the standard normal
distribution curve. That is, P[-∞ < Z < -1.645]
list pvalue in 1
gen zvalue=invnormal(0.05)
Returns the left-tail z-value when the p-value is 0.05 under the standard normal distribution
curve. That is, if normal(z) = 0.05 then invnormal(0.05) = z.
list zvalue in 1
gen pvalue2=ttail(30, 1.645)
Returns the right-tail p-value when the critical value is 1.645 under the t-distribution curve with
30 degrees of freedom. This is, P[T > 1.645] if degrees of freedom is 30
list pvalue2 in 1
gen tvalue=invttail(30, 0.05)
Returns the right-tail t-value when the p-value is 0.05 under the t-distribution curve with 30
degrees of freedom. That is, if ttail(30, t) = 0.05, then invttail(30, 0.05) = t.
list tvalue in 1
ONE SAMPLE T-TEST
sysuse auto, clear
ttest mpg=20
15
One-sample t-test
PAIRED SAMPLE T-TEST
ttest price=length
Paired t-test
TWO SAMPLE T-TEST
ttest price=length, unpair
Unpaired two sample t-test with equal variance
ttest price=length, unpair unequal
Unpaired two sample t-test with unequal variance (by Satterwaite’s degrees of freedom)
TWO GROUP T-TEST
sort foreign
ttest mpg, by(foreign)
Unpaired two population t-test with equal variance
ttest mpg, by(foreign) unequal
Unpaired two population t-test with unequal variances
sysuse census, clear
Loads 1980 census data by state
sort region
ttest medage if region==1 | region==4, by(region)
Unpaired two population t-test with equal variance
sysuse auto, clear
xtile quart=price, nq(4)
16
Group price into quartiles
tab quart
sort quart
by quart: sum price
ttest weight if quart==1 | quart==4, by(quart)
17
VI: Regression
CORRELATION
sysuse auto, clear
corr displ weight rep78
Uses observations that exist for all the variables
pwcorr displ weight rep78
pwcorr displ weight rep78, obs
corr displ weight rep78, covariance
pwcorr displ weight rep78, sig
pwcorr displ weight rep78, sig obs
pwcorr displ weight rep78, star(0.05)
SIMPLE OLS
sysuse auto, clear
correlate weight length (or corr weight length)
Computes a correlation coefficient between variable “weight” and “length”
reg weight length
It runs a regress with a constant. You can also use the pull-down menu by clicking “Statistics”
and then “Linear Models and Related,” and then “Linear Regression.”
reg weight length, nocons
Run regress without a constant
FITTED LINE IN SCATTER PLOT
18
reg mpg displ
Regress “mpg” on “displ” and a constant
predict pmpg
Obtain predicted values from the regression
twoway (scatter mpg displ) (line pmpg displ)
Overlay a fitted line over the scatter plot
MULTIVARIATE REGRESSION
reg price mpg rep78 headroom trunk weight
outreg2 using c:\tabel_1, word excel replace
This automatically generates a table in word/excel files.
Simply, open the file from your word or excel.
If outreg2 is not installed, go to search and type in “outreg.” Download sg97.3.
tab rep78, gen(repair)
reg price mpg foreign weight repair1-repair4
outreg2 weight foreign using c:\tabel_1, word excel replace
The table does not report results for dummy variables.
outreg2 using c:\tabel_1, drop(repair*) word excel replace
Same result.
outreg2 using c:\tabel_1, keep(weight foreign) word excel replace
Same result.
reg price mpg rep78 headroom trunk weight
outreg2 using c:\tabel_1, bdec(2) tstat adjr2 rdec(2) word excel replace
Report t-stat (instead of standard error), adjusted R squared (instead of R squared), coefficients
19
in two decimals, adjusted R squared in two decimals.
reg price mpg rep78 headroom
outreg2 using c:\tabel_1, bdec(2) tstat adjr2 rdec(2) word excel
Without option “replace,” the result will show up in the second column.
CONTROL
reg price foreign
This will show no relationship between ‘foreign’ and ‘price’
But what if foreign cars are lighter and that is pulling down the price? What would be the
relationship between ‘price’ and ‘foreign’ among the cars with similar car weight?
pwcorr foreign weight
sum weight, detail
gen dum_w=1
replace dum_w=2 if weight>=2230
replace dum_w=3 if weight>=3190
replace dum_w=4 if weight>=3600
sort dum_w
by dum_w: reg price foreign
What are the coefficients on foreign?
Last regression is not estimated because there is no variation in the ‘foreign’ variable. That is,
all cars in the category (“heavy cars”) are domestic cars. Recall the formula for coefficient
variance.
reg price foreign weight
What happens to the coefficient on foreign?
20
The concept of control
reg price foreign
This will show no relationship between ‘foreign’ and ‘price’
But what if foreign cars are shorter and that is pulling down the price? What would be the
relationship between ‘price’ and ‘foreign’ among the cars with similar length?
pwcorr price foreign length
sum length, detail
gen dum_l=1
replace dum_l=2 if length>=170
replace dum_l=3 if length>=193
replace dum_l=4 if length>=204
sort dum_l
by dum_l: reg price foreign
What are the coefficients on foreign?
Last regression is not estimated because there is no variation in the ‘foreign’ variable. That is,
all cars in the category (“heavy cars”) are domestic cars. Recall the formula for coefficient
variance.
reg price foreign length
What happens to the coefficient on foreign?
The concept of control
COLLINEARITY
gen domestic=~foreign
Generate a variable named domestic
21
reg price weight mpg foreign domestic
Case of perfect collinearity
What happens to the variable domestic?
reg price displ
reg price displ weight
Compare the t-values
corr displ weight
Case of multicollinearity
reg price displ weight
vif
Displays VIF and 1/VIF for each right-hand-side variable
VIF = 1/(1-R2) from a regression where the variable of concern is on the left-hand-side and all
other independent variables are on the right-hand-side (if above 5, evidence of multicollinearity)
RIGHT HAND SIDE DUMMY VARIABLES
sysuse auto, clear
xi: reg mpg weight i.rep78
Regression with multiple dummy variables
xi: reg mpg i.foreign|weight
Regression with a “foreign” dummy variable that interacts with a continuous variable
xi: reg mpg i.foreign*weight
Regression with a “foreign” intercept dummy and a “foreign” dummy variable that interacts
with a continuous variable
xi: reg mpg i.foreign*i.rep78
Regression with dummy variables interacting with each other
22
VII: More on Regressions
F-TESTS
reg mpg foreign weight length
test length=0
F-test if coefficient on length is zero
Compare p-value from t-test
test weight=length
F-test if weight=length
test (foreign/100)-length=weight
F-test if (foreign/100)-length=weight
test foreign=0
F-test if coefficient on foreign is zero
test foreign=0
test length=0, accumulate
Joint hypotheses testing
F-test if coefficients on “foreign” and “length” are zero
test foreign=0
test length=0, accumulate
test weight=0, accumulate
F-test if coefficients on “foreign,” “length,” and “weight” are zero
Compare it with goodness of fit test
test foreign weight length
Same as above
test
Shows the last test again
23
REGRESSION DIAGNOSTICS
reg mpg foreign weight length
avplot weight
Draws added-variable plots (also known as partial regression line)
Partial regression line: fitted line between two residuals
One residual is estimated by regressing mpg on foreign and length
The other residual is estimated by regressing weight on foreign and length
avplots
Draws added-variable plots for all independent variables in a single graph
reg price displ
reg price displ, beta
Standardized coefficients
reg price weight mpg foreign
predict estu if e(sample), rstudent
Generates studentized residuals
list estu if abs(estu)>1.96
Shows studentized residuals if its absolute value is greater than 1.96
reg price weight mpg foreign if abs(estu)<1.96
HETEROSCEDASTICITY
reg mpg weight
rvfplot, yline(0)
SERIAL-CORRELATION
24
sysuse uslifeexp, clear
Load sample STATA file named US Life Expectancy
reg le_male le_female
rvfplot, yline(0)
Residual-versus-fitted plot
25
VIII: Using Log Files and Do Files
LOG FILES
sysuse auto, clear
log using c:\stata10\task_1, text
Start a log file and save it in a text file.
Without any option, save in a file with an extension of “smcl.” This file can be viewed from
STATA by clicking the log icon.
sum rep78, detail
log close
Saves the task of summarizing variable “rep78” in a log file named task_1.log
Can open the log file using MS-Word
sysuse auto, clear
log using c:\stata10\task_1, text replace
Replace the old log file
sum rep78, detail
log off
Temporarily stop the log file
gen w_l=weight*length
log on
Restart the log file
sum w_l, detail
log close
Saves the task of summarizing variable “rep78” and “w_l” in a log file named task_1.log
26
DO FILES
Open do-file editor
Type in the commands that you want to execute
sysuse auto, clear
log using c:\stata10\task_3, text
reg mpg length weight foreign
sort foreign
by foreign: reg mpg length weight foreign
log close
Save the file with an extension “do.” For example, c:\stata10\task_2.do
do c:\stata10\task_2.do (or run c:\stata10\task_2.do)
Executes the do file
Run is different in that it does not display the execution in the screen
Use “do” instead of “run” when generating a log file
27
IX: Other Useful Commands
LARGE FILES
clear
set memory 512000
Sets the amount of memory STATA will use to be 512000 kilobytes
set memory 512000, permanently Specifies that in addition to making the change right now, Stata will remember the new limit and use it in the future.
set maxvar 8000 Sets the maximum number of variables that can be included in any of Stata's estimation commands
set maxvar 8000, permanently Specifies that, in addition to making the change right now, Stata will remember the new limit and use it in the future.
query memory
Displays memory settings
MERGE FILES
sysuse auto
keep make price mpg
sort make
Must sort before merging
save c:\stata10\auto_1
browse
sysuse auto
28
drop price mpg
sort make
Must sort before merging
save c:\stata10\auto_2
merge 1:1 make using c:\stata10\auto_1
1:1 → one-to-one merge
m:1 → many -to-one merge
1:m → one-to- many merge
m:m → many-to- many merge
tab _merge
browse
use c:\stata10\auto_1
drop if make==”Audi Fox”
sort make
save, replace
browse
use c:\stata10\auto_2
drop if make==”BMW 320i”
drop if make==”Buick Opel”
sort make
29
save, replace
merge 1:1 make using c:\stata10\auto_1
If _merge==1, the original file has the observation, but the merging file does not
If _merge==2, the merging file has the observation, but the original file does not
If _merge==3, both files have the observation
tab _merge
browse
RESHAPE A FILE
sysuse bplong
reshape wide bp, i(patient) j(when)
Reshapes the file to be wide. That is, time-variable is displayed in rows