Introduction to Stata

50
STATA9 Instructor : Samaa H. Hosny Ph.D. Candidate Sunday-Wednesday 10-14 May 2009 Faculty of Economics and Political Science, Cairo University

description

This is a short introduction course to Stata statistical software version 9. The course still applies to later versions of Stata, too. The course duration was 9 hours. It has been given at the Faculty of Economics and Political Science, Cairo University.

Transcript of Introduction to Stata

Page 1: Introduction to Stata

STATA9

Instructor:

Samaa H. HosnyPh.D. Candidate

Sunday-Wednesday

10-14 May 2009

Faculty of Economics and Political Science, Cairo University

Page 2: Introduction to Stata

Section 1:

Introduction and Overview

Page 3: Introduction to Stata

1. Stata interface:

- windows,

- icons vs. syntax, and

- initial output

2. Stata community and website: www.stata.com

Page 4: Introduction to Stata

3 .Compare Stata with SPSS

As per statistical capabilities, Stata can do a lot more than SPSS, i.e. more advanced.

SPSS is more inclined towards the business world.

Stata is more inclined towards the research community. It offers a helpful exchange of ideas and experience between its academic users.

Page 5: Introduction to Stata

3 .Compare Stata with SPSS (cont’d) Can receive updates as well as ado files, i.e. you

don’t need to wait for a new version to run new commands.

Compare the two websites: www.spss.com and www.stata.com

Join the mailing list of updates: send a message to [email protected] and write:

subscribe statalist email@address

OR for daily summary you can write:

subscribe statalist-digest email@address

Page 6: Introduction to Stata

4. Introducing the three Stata editions:

Stata/SE: special edition Stata/IC: Intercooled Stata/small: small (educational)

5. Two dimensions of data: cases and variables

Page 7: Introduction to Stata

Stata/ SEStata/ICStata/

small

Max. no. of variables

32,7662,04799

Max. no. of observations

2,147,483,6472,147,483,6471,000

Max. no. of characters for

a string variable

2448080

Matrices1,000 x 1,000800 x 80040 x 40

Page 8: Introduction to Stata

6. Help: Built-in (offline) / internet (online)- From Stata: Help >> Search >>Search all- findit keyword- Very helpful links at:

http://www.stata.com/links/resources1.html

7. File Types:

- Data file: filename.dta- Do file: filename.do - Ado file: filename.dta- Log file (only readable in Stata): filename.smcl- Log file (text file): filename.log

Page 9: Introduction to Stata

8 .Main tasks1. Accessing data2. Entering data in Stata3. Convert data (via StatTransfer) from formats such

as text, Excel, SPSS, SAS, and other softwares4. Save Stata format as a .dta file5. File and data preparation6. Descriptive statistics 7. Tabulating data: frequency tables and cross

tabulations8. Graphics9. Data analysis10. Preparing a report using Stata output: (creating a

Word document)

Page 10: Introduction to Stata

9 .Good Practices

1. Documentation within a program (using the *)

2. Intuitive variable names, labels, and file names

3. Avoid destroying or over-writing original data

4. Appropriate use of command abbreviation

5. Keeping a record of work: commands (do) and outputs (log)

Page 11: Introduction to Stata

Section 2:

Getting Started

)Hands-On Applications(

Page 12: Introduction to Stata

1 .Opening and closing Stata

Like any other software:

Start button>>Programs>>Stata

OR

Double click on the software icon on the

desktop

Page 13: Introduction to Stata

2 .Memory issues: checking and setting memory capacity

To check memory status (default is 1m = 1 Megabytes)

memory

To change the memory (needed for large data sets)

set memory 750m

set memory 750m,perm

Page 14: Introduction to Stata

3 .Current directory and changing it

The current directory is found in the status bar as a path below the 4 windows. To use and save files without the need to rewrite the whole required path every time we write a “use” or “save” command, we change the directory to the one we want to deal with directly as follows:

cd D:\Stata\Data //since no spaces included in the name

OR

cd “D:\Stata\Data Files” //a space in the path name requires using quotes

Page 15: Introduction to Stata

4. Opening and closing files (data, log, and do)

Open file existing in any directory:

use C:\folder\filename.dta

Open file existing on the current directory:

use filename

Open specific variables :

use age region using C:\folder\filename.dta

Open specific cases :

use C:\folder\filename.dta if male==1 //using a certain category

use C:\folder\filename.dta in 1/10 //using the first 10 cases only

Page 16: Introduction to Stata

5 .Preparing a report using Stata output: (creating a Word document)

Steps:

1. Open a log file to save all contents of the session (commands and outputs) using:

log using filename.log //to have a text file not an .smcl file

2. Carry on all the analysis required, then write:

log close

3. Open this file using any .txt reader and copy it to a Word file

4. Format the Word file with font: “Courier New” of size 8 or 9 to have exactly the same shape of output as in the Stata output window

Page 17: Introduction to Stata

6 .Viewing existing data in a data file

To view all existing variable names, specification, labels, number of variables and observations we use:

describe

d, fullnames //to avoid abbreviations in names

To namely select some of them we use:

describe region1 region2

To select all those starting with the same letters (e.g. reg) we use:

d reg* // describe all variables starting with reg-

d *tion // describe all variables ending with -tion

Page 18: Introduction to Stata

To view all existing observations:list To view some selected observations:list in 1/5 //1st 5 observations

To view some selected variables:l age gov sex in 1/10 //1st 10 observations in the three variables

l age gov sex if male==1 //only males in the three variables

li X* //all variables starting with X

6 .Viewing existing data in a data file (cont’d)

Page 19: Introduction to Stata

Important Notes!

To avoid running very long outputs in general, for

example all the observations (in case of very large

datasets) we can use: the Break icon in the toolbar

or from the Keyboard: Ctrl+C at any time to stop

getting more output from the same command.

To permanently switch off the –more- option

between pages of output we type:

set more off

6 .Viewing existing data in a data file (cont’d)

Page 20: Introduction to Stata

7 .Entering and saving data

1. Manually through the keyboard: string variables should be specified as str before the varname (e.g. var3 is string of 9 places, it’s str9):

input var1 var2 str9 var3

val11 val12 “val13”

.

.

.

valN1 valN2 valN3

end

Page 21: Introduction to Stata

7 .Entering and saving data

2. Manually through the data editor

Enter values in the table cell by cell (where the

cursor (colored cell) is.

Double click on the varname and edit its name,

label, and format.

Page 22: Introduction to Stata

3. Download or search for datasets by:

Typing in the command window:

help datasets

searching www.stata.com for datasets

searching the internet

7 .Entering and saving data

Page 23: Introduction to Stata

4. Using StatTransfer to transfer any spreadsheet into Stata format (The best way in order not to lose any data) as well as maintaining all variable labels and storage types (in case the file was in SPSS or any other statistical package saving information about the variables)

7 .Entering and saving data

Page 24: Introduction to Stata

4. Save an Excel file with variable header (i.e. varnames in the first row)>>select all>> copy from Excel sheet >>highlight the upper left cell in Stata data editor>>paste

6. Save an Excel file using tab delimited format (.txt) without variable headers (i.e. all columns are values)

Then type in Stata command window:

insheet using Book1.txt

7 .Entering and saving data

Page 25: Introduction to Stata

Take care of any data that might have been missed while transferring to Stata without StataTransfer

Make sure you label the variables and rename them in Stata after the insheet

Also check Stata infile command Note that Stata10 reads directly from Excel

by using the file icon in the Stata interface.

7 .Entering and saving data (Precautions)

Page 26: Introduction to Stata

8 .Labeling data, variables, and values

label data “This is Employment data”

label variable employ "Employment Status”

label define employed 0 “unemployed" 1 “employed”

label values employ employed

label define employed 2 "Other", add

OR

label define employed 0 "0: No" 1 "1: Yes" 2 "2: Other", modify

Page 27: Introduction to Stata

9. Describing and tabulating data

1. An overview of data • The first step is to see the data (variables

and observations) by the ‘list’ and ‘describe’ commands.

• See the labels of a variable in full name

label list name

NB! Here we type the name of the label list NOT the varname

Page 28: Introduction to Stata

9. Describing and tabulating data

2. Summarizing data (descriptive statistics):

For quantitative data (numeric variables only)

summarize

To show basic descriptives of var X: i.e. No. of obs., mean,

st.dev., min, & max values

sum X

To show detailed descriptives of var X: basic + percentiles,

variance, and skewness

sum X, detail

Page 29: Introduction to Stata

9. Describing and tabulating data

3. Frequency tables

tabulate X

ta X, nolabel //shows codes NOT labels

tab1 X1 X2 X3 X4 //for each one separately

ta X, summarize(Y) //summarizes Y for each category

of X

Page 30: Introduction to Stata

4. Cross tabulations Can take up to 2 variables: Y on rows, X on columns with totals:

ta Y X

ta X1 X2, row //displays row percentages (% for each category)

ta X1 X2, row nofreq //displays row percentages without

frequencies

bysort X: tab Y, summarize(Z) missing

//for each categ. of X (including the missing categ.),

we tabulate Y and calculate basic descriptives of Z

9. Describing and tabulating data

Page 31: Introduction to Stata

4. Crosstabs (cont’d)

Another command. More flexible in options esp. weights.

Can take up to 3 variables, with Y as the rows and X2 as the columns

for each category of X1.

table Y X1 X2, row col //a new row and col. for totals

table Y X, by(Z) //a separate table for Y on rows

and X on columns for each

category of Z

9. Describing and tabulating data

Page 32: Introduction to Stata

1. Creating case number (case id)

generate id=_n

2. Deleting existing variables/casesdrop X //deletes variable X

keep X Y Z //deletes all other variables

drop if gov==1 //deletes all cases in this governorate

drop if ~male //deletes females and missing values

keep if age>=15 & age<=60 //deletes all other cases

10. Data manipulation

Page 33: Introduction to Stata

3. Dealing with Variable Groups:

Grouping variables in a variable set

global set1 “X1 X2 X3 X4”

When we use this variable set in any command, we call it by adding a $ before the name. For example:

tab1 $set1

10. Data manipulation

Page 34: Introduction to Stata

3. Dealing with Variable Groups:

The use of dash (-)

for var X1-Y10: rename XX_2009

//will be executed on variables X1 to Y10

The use of star (*): (previously discussed)

describe X* des demo*99

list *Y* list *W

10. Data manipulation

Page 35: Introduction to Stata

10. Data manipulation

4. Creating new variables (gen)

generate y=1

g z=1 if (x=5)

gen samplesize=_N //column of a constant=total number of

observations in the dataset (total sample size)

bysort family: gen famsize=_N//column of constants=total number of observations

in each family (add up to sample size)

Page 36: Introduction to Stata

10. Data manipulation

Creating new variables (gen) (cont’d)gen l_income=log(income) //natural log

OR gen l_income=ln(income) //natural log

gen loginc=log10(income) //base 10 log gen Y=sqrt(X) //get square-root of X

gen Z=exp(Y) //get the exponential

gen sqage=age^2 //get the square age

gen XY=X*Y //interaction term

gen lagYt = Yt[_n-1] //lagYt=Yt-1

Page 37: Introduction to Stata

10. Data manipulation

Creating new variables (egen and its options)egen avage = mean(age) //mean age of sample (only 1 value)

bysort hh: egen avg = mean(age) //mean age for every hh

egen meddiff = median(var1-var2) // (exp, - means subtraction)

median of the difference

egen avginc = rowmean(W X Y Z)

OR egen avginc = rowmean(W - Z) //(varlist, - means through)

egen ttlsales = total(sales), by(region)

Page 38: Introduction to Stata

Dummy variable construction: Manual (allowing missing values)gen female=0 if sex==2replace female=0 if sex==1

Automatic (NOT allowing missing values)gen married=(mrtst==2) //generate a dummy for married

tab region, gen(region)//generate 6 dummy variables for the 6

regions

xi commands for categorical dataxi: tab1 i.region

//can be used with any command. Dummies not saved

10. Data manipulation

Page 39: Introduction to Stata

The commands that draw graphs are

command description ------------------------------------------------------------------------------------------ graph twoway scatterplots, line plots, etc. graph matrix scatterplot matrices graph bar bar charts graph dot dot charts graph box box-and-whisker plots graph pie pie charts other more commands to draw statistical graphs ------------------------------------------------------------------------------------------

11. Graphics

Page 40: Introduction to Stata

The commands that save a previously drawn graph, redisplay previously saved graphs, and combine graphs are

command description ------------------------------------------------------------------------- graph save save graph to disk graph use redisplay graph stored on disk graph display redisplay graph stored in memory graph combine combine graphs into one -------------------------------------------------------------------------

11. Graphics

Page 41: Introduction to Stata

11. Graphics

1. Histograms

histogram X // draws a histogram for variable X

histogram X if male==1 in 1/1000 //histogram for variable X for males only in the first 1000 cases

histogram X, percentage normal // histogram for variable X along with the normal curve

For more info on options: help histogram

Page 42: Introduction to Stata

11. Graphics

2. Bar graphs

graph bar X // draws a bar chart with vertical bars for variable X

graph hbar Y // draws a bar chart with horizontal bars for variable Y

For more info on options: help graph bar

Page 43: Introduction to Stata

11. Graphics

3. Scatterplots

graph twoway scatter Y X twoway scatter Y X scatter Y X

The above three commands are equivalent.

Page 44: Introduction to Stata

graph twoway (scatter y1 x) (scatter y2 x) // draws a scatter plot for variable y1 against x and for y2 against x

This is equivalent to typing

OR twoway scatter y1 x || scatter y2 x

graph twoway (scatter y x) (lfit y x) // draws a scatter plot for variable y against x and adds the linear prediction fit

graph matrix X1 X2 X3 // scatterplot matrices for the three variables together (two at a time)

For more info on options: help scatter

OR help graph_twoway

11. Graphics

Page 45: Introduction to Stata

11. Graphics

4. Line graphs

graph twoway line Y X twoway line Y X line Y X

The above three commands are equivalent.

For more info on options: help line

OR help graph_twoway

Page 46: Introduction to Stata

11. Graphics

5. Labeling graph and graph axes

scatter lexp region, title("Scatter plot") subtitle("Life expectancy at birth, US") yvarlabel("life expectancy") xvarlabel("Region")

Note: The whole command should be written on one line.

Page 47: Introduction to Stata

11. Graphics

6. Saving graphs This command is written directly after the graph that

you wish to save:e.g.scatter yvar xvargraph save mygraph //save previous graph

This will create the file mygraph.gphORscatter yvar xvar, saving(mygraph)

graph use mygraph //use saved graph

Page 48: Introduction to Stata

7. Combining graphs

e.g. using lifeexp.dta:

scatter lexp region, saving(figure1, replace)

scatter gnppc region, saving(figure2, replace)

graph combine figure1.gph figure2.gph, saving(byregion)

11. Graphics

Page 49: Introduction to Stata

12 .Further topics of interest

It should be noted that data should be weighted to be representative of the population (help weight)

Stata can merge files (add variables from one file to another) and append files (add cases).

(help merge) and (help append) Numerous options are present with

every command

Page 50: Introduction to Stata

13 .Matrices

To input matrix A:

11 530

550 32130 , we do the following:

matrix input A=(11,530\550,32130)mat list A // to show the matrix content

mat define detA=det(A) // to get the determinant of A

mat define invA=inv(A) // to get the inverse of A

mat define transA=A’ // to get the transpose of A

Mat D=A+B // to get the sum of A and B