Last Revised: June 2012 Prepared by Youssef Zaghlool ... · PDF fileUniversity Academic...

University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 1

Last Revised: June 2012 Prepared by Youssef Zaghlool

The American University in Cairo University Academic Computing Technologies


What is Stata?............................................................................................................................................... 3

Why use Stata? ............................................................................................................................................. 3

Starting up Stata ........................................................................................................................................... 4

Datatype (Numeric or string data) ............................................................................................................... 5

Missing Data ................................................................................................................................................. 5

Browsing and Editing Data ........................................................................................................................... 6

Observations................................................................................................................................................. 7

Value Labels .................................................................................................................................................. 8

Getting Data into Stata ................................................................................................................................ 8

Commands in Stata .................................................................................................................................... 10

If Condition ................................................................................................................................................. 11

Lowercase and Uppercase Letters ............................................................................................................. 11

Logical and Mathematical Functions ......................................................................................................... 12

Stata Command Syntax Options ................................................................................................................ 13

Several Practices ......................................................................................................................................... 13

Review Window, and Abbreviating Command Names ............................................................................. 14

Basic Statistics ............................................................................................................................................ 15

Frequencies................................................................................................................................................. 17

Correlations ................................................................................................................................................ 19

Code Practice .............................................................................................................................................. 21

Graphs: Histograms .................................................................................................................................... 23

STATA: Data Analysis and Statistical Software


What is Stata?

Stata, like SPSS (Statistical Package for Social Science), is a general purpose statistical software package.

It is command-based software, and is available for Windows, Macintosh, and Linux systems. Stata

provides a highly flexible interactive mode that makes it easier for beginners to learn and use. Stata also

supports features for programming and matrix manipulation.

Stata provides a broad range of analyses, including:

Descriptive Statistics

Regression models

ANOVA (analysis of variance)

Categorical and limited dependent models (e.g., logit and probit)

Panel data models

Nonparametric methods

Multivariate methods

Cluster analysis

Survival analysis

Time series analysis

Why use Stata?

1) Intuitive data management capabilities. The creation of variables and sub-setting data is simple and

straightforward.

2) Wide variety of statistical procedure that can be accessed via a point-and-click method.

3) Syntax is provided so users can learn to code quickly.

4) Users can share complex coding syntax with others.

Possible weaknesses

1) Not always easy to handle large datasets.

2) Documentation can be sparse for those who want more detail.

3) Complex programming can become challenging.



Starting up Stata

Stata can be installed on pretty much most PCs that have Windows XP, Vista or 7 installed. It’s also

compatible if you are running on a MAC OSX or on a Linux machine. If you are using Windows, it easiest

way to run the Software is through Start -> All Programs -> Stata.

When Stata starts up you will see five docked windows, initially arranged as shown below:

1. The window labeled Command is where you type your commands.

2. Stata then shows the results in the larger window immediately above, called appropriately

enough Results.

3. Your command is added to a list in the window labeled Review on the left, so you can keep track

of the commands you have used.

4. The window labeled Variables, on the top right, lists the variables in your dataset.

5. The Properties window immediately below that, new in version 12, displays properties of your

variables and dataset.



You can resize or even close some of these windows. Stata remembers its settings the next time it runs.

You can save (and then load) named preference sets using the menu Edit|Preferences. You can also

choose the font used in each window; just right click and select font from the context menu

Numeric or string data

Stata stores or formats data in either of two ways – numeric or string. Numeric will store numbers (e.g.

years, GDP figures) while string will store text (e.g. country name). Strings can also be used to store

numbers, but you will not be able to perform numerical analysis on those numbers. Note, with string

variables, you must enclose the observation reference in double quotes. Otherwise, Stata will claim not

to be able to find what you are referring to. For example:

summarize if country==“USA”

You can't do any kind of math with a string variable--even if the characters making up the string happen

to be numbers! To Stata, the value 1 and the character "1" are completely different things. For example,

1+1 is 2, but "1" + "1" is "11". Note the quotation marks: whenever you talk about string variables the

values need to go in quotation marks.

It's a common mistake when importing data to accidentally make Stata think a numeric variable is a

string. The values of string variables are red in the data browser (like make in this data set) so if you start

seeing red where you shouldn't you know you've got a problem.

There are several datatypes for Numeric, which depends on the number of digits you will be using

1) Byte: From -127 to 100

2) Int: from -32,767 to 32,740

3) Long: from -2,147,483,647 to 2,147,483,620

4) Float: from 1.70141173319*1038 to 1.70141173319*1038

5) -8.9 *10307 to 8.9*10307

Default mode is set to “float” unless you are going to be dealing with more digits.

Missing Data

Be aware of missing data in Stata. Missing data can result when you compute a number whose answer is

not defined. Missing data can also result during data collection; for example, in data on publicly listed

companies often R&D expenditures data are unavailable. Missing data can be entered in Stata by using a

period instead of a number. Missing data can be used in Stata calculations. For example, you can check

whether salary income is missing, and only list the data for observations where this is true:

list if salaryincome==. List only observations in which salary is missing.

Missing values for string variables are denoted by "", the empty string; not to be confused with a string

that is all blanks, such as " ".



Browsing and Editing Data

If you want to look at the data in a spreadsheet-like format but not change them, it is bad practice to

use Stata’s data editor, as you could accidentally change the data! Instead, use the browser via the

button at the top, or by using the following command. Or list the data in the main window.



browse:- Opens the data viewer, to look at data without changing them. Close the viewer before using

other commands.

list:- Lists data. If there’s more than 1 screenful, press space for the next screen, or q to quit listing.

Observations

A Stata data set consists of observations (rows), variables (columns) and values (cells). While all the

observations in a given data set should represent more or less the same thing, the meaning of

"observation" can vary widely between data sets and it's important to keep track of what it means in

yours.

The coloring schema is very important

Red: String

Black: Numeric

Blue: Value Labels



Value Labels

In this data set, the variable foreign appears to be a string with the values Domestic . But note that it's in

blue rather than red (for String), and that at the top of the browser window it lists the value of foreign

for the first observation as 0.

Getting Data into Stata

If you need to type in data by hand, you can do so in the data editor. However, you should define the

variables first so you can choose their types yourself. If you just start typing in the data editor Stata will

try to guess, but it will sometimes make mistakes like thinking a numeric variable should be a string.

To create a new variable, click Data, Create or change data, Create new variable. Type the name you

want to give the new variable in the Variable name box. In Contents of the variable choose Fill with

missing data--you'll type in the real values later.



For most numeric variables the default Variable type, float, will be fine.

To create a text variable, change the Variable type to str (string).



You can also create the exact same variables by typing the generate command (abbreviated gen). This

could be substantially faster than the menus if you have many variables to create. The syntax is simply

gen, then the variable type if it's anything but float, then the variable name, and finally what to set it

equal to.

For numeric variables, "missing" is denoted by a period (.). Thus the command to create a float variable

called x and set it to missing is:

gen x=.

"Missing" for a string is a string with nothing in it, or "" (an opening quote immediately followed by a

closing quote). Thus the command to make a string (str) variable called y and set it to missing is:

gen str y=""

Commands in Stata

The list command simply prints the content of your data set. To get the simplest list, click Data, Describe

data, List Data and click OK in the resulting window without changing anything. Alternatively, you can

type list or even just l (which Stata understands as an abbreviation for list).

This gives you a great deal of information, probably more than you want. Thus it's useful to limit the

command so it only lists what you want to see. This is typical in Stata: once you've chosen what you

want to do (i.e. picked a command) you then need to tell Stata what you want it to act on.

If Condition

If you want the command to only act on certain observations, you need to tell Stata which observations.

This takes the form of an If condition, and the command will only act on those observations where the

condition is true.

Click on the by/if/in tab, and in the If: box type gender==”m”. Click Submit and you'll get a listing of all

the employee names, age and salary of just those of the male employees.



The reason you had to type two equals signs (==) between gender and “m” is that Stata uses the equals

sign for two different purposes. One equals sign is used for assignment when creating or changing

variables: gender =m means "make gender “m”." Two equals signs are used in tests and conditions. In

that form it's a question: "Is gender equal to m?"

Note that in the command the list of variables comes before the if condition:

list employee if gender = "m"

list make price mpg if gender = “m” & salary > 4000

This example will display all the male employees with a monthly salary of 4000 and more.

Lowercase and Uppercase Letters

Case matters: if you use an uppercase letter where a lowercase letter belongs, or vice versa, an error

message will display.



Logical and Mathematical Functions

When creating expressions (for example, to use in generating a variable), it is necessary to know Stata’s

syntax for various functions. Here is an extensive list:

~ not

| or

& and

== equals

+ plus

minus

* multiplied by

/ divided by

^ raised to

> greater than

>= greater than or equal to

< less than

<= less than or equal to

~= not equal to

Edit: Opens the data editor, to type in or paste data. You must close the data editor before you can run

any further commands.

Keep and Drop

The original dataset may contain variables you are not interested in or observations you don’t want to

analyze. It’s a good idea to get rid of these first – that way, they won’t use up valuable memory and

these data won’t inadvertently sneak into your analysis. You can tell Stata to either keep what you want

or drop what you don’t want – the end results will be the same.

keep ID (Keep ID only and drop all the other varibles)

or

drop ID (Will only drop ID and keep all the other varibles)



Stata Command Syntax Options

(1) a list of variables

(2) an if-statement

(3) options

A list of variables consists of the names of the variables, separated with spaces. It goes immediately

after the command. If you leave the list blank, Stata assumes that you mean all variables.

Examples:

edit var1 var2 var3 Opens the data editor, just with variables var1, var2, and var3.

edit Opens the data editor, with all variables.

Several Practices

edit var1 if var2 > 3 Opens the data editor, just with variable var1, only for observations in which var2 is

greater than 3.

edit if var2 == var3 Opens the data editor, with all variables, only for observations in which var2 equals

var3.

edit var1 in 10 Opens the data editor, just with var1, just in the 10th observation.

edit var1 in 101/200 Opens the data editor, just with var1, in observations 101-200.

edit var1 if var2 > 3 in 101/200 Opens the data editor, just with var1, in the subset of observations 101-

200 that meet the requirement var2 > 3.

Options alter what the command does. There are many options, depending on the command – get help

on the command to see a list of options. Options go after any variable list and if-statements, and must

be preceded by a comma. Do not use an additional comma for additional options (the comma works like

a toggle switch, so a second comma turns off the use of options!). Examples:

use "filename.dta", clear Reads in a Stata-format data file, clearing all data previously in memory!

(Without the clear option, Stata refuses to let you load new data if you haven’t saved the old data. Here

the old data are forgotten and will be gone forever unless you saved some version of them.)

save "filename.dta", replace Saves the data, replacing a previously-existing file if any. You will see more

examples of options below.



Review Window, and Abbreviating Command Names

The Review window lists commands you typed previously. Click in the Review window to put a previous

command in the Command window (then you can edit it as desired). Double-click to run a command.

Another shortcut is that many commands can have their names abbreviated. For example below instead

of typing “summarize”, “su” will do, and instead of “regress”, “reg” will do.



Basic Statistics

Stata has a large number of commands dedicated to basic statistics; we'll discuss some of the most

commonly used. Feel free to skip any you don't need.

The basic summarize command gives you a number of observations, means, standard deviations,

minimums and maximums. To use it, click Statistics, then Summaries, tables, and tests, then Summary

and descriptive statistics and finally Summary statistics. Select or type mpg in Variables, then click

Submit.

Alternatively, you could have just typed:

sum mpg

and gotten the exact same thing (sum being an abbreviation for summarize).

Missing values are ignored when calculating summary statistics. If you type:

sum rep78



you'll see that the number of observations is 69 rather than 74 like it was for mpg. Five observations

have missing values for rep78 and could not be included in the calculations, so the mean was calculated

over the 69 observations that do have valid values.

Variable lists, if conditions, by groups and options work for summarize just like they did for list in the

examples in part one. You've already seen a variable list in action (getting summary statistics for mpg or

rep78 rather than all variables). Next, find the mean of mpg for cars weighing over 4,000 pounds by

clicking by/if/in and typing weight>4000 in the If box.

The command for this is: sum mpg if weight>4000



Frequencies

The tabulate command is used to create frequency tables. It has two variants: one for one-way tables

and one for two-way tables. If you type tab, Stata will figure out version which you want by looking at

how many variables you list afterwards. But if you're using menus you'll click Statistics, Summaries,

tables, and tests, Tables and then either One-way tables or Two-way tables with measures of

association.

One-way Tables

A one-way table simply lists the values of a variable and how many times each value appears in your

data set. To create a one-way table, click Statistics, Summaries, tables, and tests, Tables and then One-

way tables. Select or type rep78 as the Categorical variable and click Submit.



The resulting command is:

tab rep78

Note that the missing values of rep78 were not included in the table. If you want to see how many

missing values you have, you should check Treat missing values like other values. Then they'll get their

own entry.

Two-way Tables

Two-way tables tell you how many times each combination of two variables appears in your data. To

create a two-way table, click Statistics, Summaries, tables, and tests, Tables and then Two-way tables

with measures of association. Select or type rep78 for the Row variable and foreign for the Column

variable, then click Submit.

Note that missing values do not appear in the table unless you check Treat missing values like other

values, just like with one-way tables.



Correlations

To calculate the correlation between variables, click Statistics, then Summaries, tables, and tests, then

Summary and descriptive statistics and finally Correlations and covariances. Then type the names of the

variables you want the correlations for in the Variables box. This data set has several variables relating

of the size of the cars: weight, length and displacement (a measure of the size of the engine). We would

expect them to be highly correlated, but type all three in the Variables box and click Submit to verify

that hypothesis.

The command is:

correlate weight length displacement



Hypothesis Tests

To use a t-test to test the hypothesis that the mean of a variable is equal to some number, click

Statistics, then Summaries, tables, and tests, then Classical tests of hypotheses and finally One-sample

mean-comparison test. Select or type mpg as the Variable name and then type 20 in Hypothesized

mean. Click OK and Stata will test the hypothesis that the mean of mpg is 20.

The command is:

ttest mpg==20



Code: Mean, Variance, Number of Non-missing Observations, Minimum, Maximum, Etc.

summarize varlist See summary information for the variables listed.

tabulate varname: Creates a table listing the number of observations having each different value of the

variable varname.

tabulate var1 var2: Creates a two-way table listing the number of observations in each row and column.

tabulate var1 var2, exact: Creates the same two-way table, and carries out a statistical test of the null

hypothesis that var1 and var2 are independent. The test is exact, in that it does not rely on convergence

to a distribution.

tabulate var1 var2, chi2: Same as above, except the statistical test relies on asymptotic convergence to a

normal distribution. If you have lots of observations, exact tests can take a long time and can run out of

available computer memory; if so, use this test instead.

histogram varname: Plots a histogram of the specified variable.

histogram varname, bin(#) normal: The bin(#) option specifies the number of bars. The normal option

overlays a normal probability distribution with the same mean and variance.

kdensity varname, normal: Creates a “kernel density plot”, which is an estimate of the pdf that

generated the data. The “normal” option lets you overlay a normal.

scatter yvar xvar: Plots data, with yvar on the vertical axis and xvar on the horizontal axis.

Correlations and Covariances

The following commands compute the correlations and covariances between any list of variables. Note

that if any of the variables listed have missing values in some rows, those rows are ignored in all

calculations.

correlate var1 var2 … Computes the sample correlations between variables.

correlate var1 var2 …, covariance: Computes the sample covariances between variables.

Sometimes you have missing values in some rows, but want to use all available data wherever possible –

i.e., for some correlations but not others.



Generating and Changing Variables

A variable in Stata is a whole column of data. You can generate a new column of data using a formula,

and you can replace existing values with new ones. Each time you do this, the calculation is done

separately for every observation in the sample, using the same formula each time.

Generating Variables

generate newvar = … Generate a new variable using the formula you enter in place of “…”.

Examples follow.

gen f = m * a Remember, Stata allows abbreviations: “gen” means “generate”.

gen xsquared = x^2

Replacing Values of Variables

replace agesquared = age^2 Changes the value of the variable agesquared, to equal age squared. This

would be useful if you had made a mistake when you first created the variable.

replace young = age < 16 if age<. Changes the value of the variable young, to equal 1 if and only if age is

less than 16, and 0 otherwise. The “if age<.” Ensures that replacements are only made when values of

age are not missing

replace young = 0 if age>=16 & age<18 Changes the value of the variable young to 0, but only if age is at

least 16 and less than 18. That is, no change is made if age is less than 16 or if age is at least 18.

If-then-else Formulas

gen val = cond(a, b, c) Stata’s cond(if, then, else) works much like Excel’s IF(if, then, else). With the

statement cond(a,b,c), Stata checks whether a is true and then returns b if a is true or c if a is not true.

gen realwage = cond(year==1992, wage*(188.9/140.3), wage) Creates a variable that uses one formula

for observations in which the year is 1992, or a different formula if the year is not 1992.



Graphs: Histograms

To make a histogram, click Graphics, Histogram then select or type mpg as the Variable. The mpg

variable is continuous (in theory anyway) so leave Data are continuous selected. Under Y axis select

Frequency to have the bar heights labeled in terms of number of observations in each bin. Click to

Submit to see the results.

The command is:

hist mpg, freq

You can choose either the number of bins or the width of each bin (one implies the other). Check the

box by Number of bins, type in 20 and click Submit again to see the difference it makes.

If you tell Stata that your variable is discrete the resulting histogram will have one bin for each unique

value of the variable. Change Variable to rep78, choose Data are discrete and click Submit. Note that

that the checkboxes under Bins are grayed out.

Last Revised: June 2012 Prepared by Youssef Zaghlool ... · PDF fileUniversity Academic...

Documents

Transcript of Last Revised: June 2012 Prepared by Youssef Zaghlool ... · PDF fileUniversity Academic...