Last Revised: June 2012 Prepared by Youssef Zaghlool ... · PDF fileUniversity Academic...
Transcript of Last Revised: June 2012 Prepared by Youssef Zaghlool ... · PDF fileUniversity Academic...
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 1
Last Revised: June 2012 Prepared by Youssef Zaghlool
The American University in Cairo University Academic Computing Technologies
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 2
What is Stata?............................................................................................................................................... 3
Why use Stata? ............................................................................................................................................. 3
Starting up Stata ........................................................................................................................................... 4
Datatype (Numeric or string data) ............................................................................................................... 5
Missing Data ................................................................................................................................................. 5
Browsing and Editing Data ........................................................................................................................... 6
Observations................................................................................................................................................. 7
Value Labels .................................................................................................................................................. 8
Getting Data into Stata ................................................................................................................................ 8
Commands in Stata .................................................................................................................................... 10
If Condition ................................................................................................................................................. 11
Lowercase and Uppercase Letters ............................................................................................................. 11
Logical and Mathematical Functions ......................................................................................................... 12
Stata Command Syntax Options ................................................................................................................ 13
Several Practices ......................................................................................................................................... 13
Review Window, and Abbreviating Command Names ............................................................................. 14
Basic Statistics ............................................................................................................................................ 15
Frequencies................................................................................................................................................. 17
Correlations ................................................................................................................................................ 19
Code Practice .............................................................................................................................................. 21
Graphs: Histograms .................................................................................................................................... 23
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 3
What is Stata?
Stata, like SPSS (Statistical Package for Social Science), is a general purpose statistical software package.
It is command-based software, and is available for Windows, Macintosh, and Linux systems. Stata
provides a highly flexible interactive mode that makes it easier for beginners to learn and use. Stata also
supports features for programming and matrix manipulation.
Stata provides a broad range of analyses, including:
Descriptive Statistics
Regression models
ANOVA (analysis of variance)
Categorical and limited dependent models (e.g., logit and probit)
Panel data models
Nonparametric methods
Multivariate methods
Cluster analysis
Survival analysis
Time series analysis
Why use Stata?
1) Intuitive data management capabilities. The creation of variables and sub-setting data is simple and
straightforward.
2) Wide variety of statistical procedure that can be accessed via a point-and-click method.
3) Syntax is provided so users can learn to code quickly.
4) Users can share complex coding syntax with others.
Possible weaknesses
1) Not always easy to handle large datasets.
2) Documentation can be sparse for those who want more detail.
3) Complex programming can become challenging.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 4
Starting up Stata
Stata can be installed on pretty much most PCs that have Windows XP, Vista or 7 installed. It’s also
compatible if you are running on a MAC OSX or on a Linux machine. If you are using Windows, it easiest
way to run the Software is through Start -> All Programs -> Stata.
When Stata starts up you will see five docked windows, initially arranged as shown below:
1. The window labeled Command is where you type your commands.
2. Stata then shows the results in the larger window immediately above, called appropriately
enough Results.
3. Your command is added to a list in the window labeled Review on the left, so you can keep track
of the commands you have used.
4. The window labeled Variables, on the top right, lists the variables in your dataset.
5. The Properties window immediately below that, new in version 12, displays properties of your
variables and dataset.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 5
You can resize or even close some of these windows. Stata remembers its settings the next time it runs.
You can save (and then load) named preference sets using the menu Edit|Preferences. You can also
choose the font used in each window; just right click and select font from the context menu
Numeric or string data
Stata stores or formats data in either of two ways – numeric or string. Numeric will store numbers (e.g.
years, GDP figures) while string will store text (e.g. country name). Strings can also be used to store
numbers, but you will not be able to perform numerical analysis on those numbers. Note, with string
variables, you must enclose the observation reference in double quotes. Otherwise, Stata will claim not
to be able to find what you are referring to. For example:
summarize if country==“USA”
You can't do any kind of math with a string variable--even if the characters making up the string happen
to be numbers! To Stata, the value 1 and the character "1" are completely different things. For example,
1+1 is 2, but "1" + "1" is "11". Note the quotation marks: whenever you talk about string variables the
values need to go in quotation marks.
It's a common mistake when importing data to accidentally make Stata think a numeric variable is a
string. The values of string variables are red in the data browser (like make in this data set) so if you start
seeing red where you shouldn't you know you've got a problem.
There are several datatypes for Numeric, which depends on the number of digits you will be using
1) Byte: From -127 to 100
2) Int: from -32,767 to 32,740
3) Long: from -2,147,483,647 to 2,147,483,620
4) Float: from 1.70141173319*1038 to 1.70141173319*1038
5) -8.9 *10307 to 8.9*10307
Default mode is set to “float” unless you are going to be dealing with more digits.
Missing Data
Be aware of missing data in Stata. Missing data can result when you compute a number whose answer is
not defined. Missing data can also result during data collection; for example, in data on publicly listed
companies often R&D expenditures data are unavailable. Missing data can be entered in Stata by using a
period instead of a number. Missing data can be used in Stata calculations. For example, you can check
whether salary income is missing, and only list the data for observations where this is true:
list if salaryincome==. List only observations in which salary is missing.
Missing values for string variables are denoted by "", the empty string; not to be confused with a string
that is all blanks, such as " ".
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 6
Browsing and Editing Data
If you want to look at the data in a spreadsheet-like format but not change them, it is bad practice to
use Stata’s data editor, as you could accidentally change the data! Instead, use the browser via the
button at the top, or by using the following command. Or list the data in the main window.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 7
browse:- Opens the data viewer, to look at data without changing them. Close the viewer before using
other commands.
list:- Lists data. If there’s more than 1 screenful, press space for the next screen, or q to quit listing.
Observations
A Stata data set consists of observations (rows), variables (columns) and values (cells). While all the
observations in a given data set should represent more or less the same thing, the meaning of
"observation" can vary widely between data sets and it's important to keep track of what it means in
yours.
The coloring schema is very important
Red: String
Black: Numeric
Blue: Value Labels
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 8
Value Labels
In this data set, the variable foreign appears to be a string with the values Domestic . But note that it's in
blue rather than red (for String), and that at the top of the browser window it lists the value of foreign
for the first observation as 0.
Getting Data into Stata
If you need to type in data by hand, you can do so in the data editor. However, you should define the
variables first so you can choose their types yourself. If you just start typing in the data editor Stata will
try to guess, but it will sometimes make mistakes like thinking a numeric variable should be a string.
To create a new variable, click Data, Create or change data, Create new variable. Type the name you
want to give the new variable in the Variable name box. In Contents of the variable choose Fill with
missing data--you'll type in the real values later.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 9
For most numeric variables the default Variable type, float, will be fine.
To create a text variable, change the Variable type to str (string).
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 10
You can also create the exact same variables by typing the generate command (abbreviated gen). This
could be substantially faster than the menus if you have many variables to create. The syntax is simply
gen, then the variable type if it's anything but float, then the variable name, and finally what to set it
equal to.
For numeric variables, "missing" is denoted by a period (.). Thus the command to create a float variable
called x and set it to missing is:
gen x=.
"Missing" for a string is a string with nothing in it, or "" (an opening quote immediately followed by a
closing quote). Thus the command to make a string (str) variable called y and set it to missing is:
gen str y=""
Commands in Stata
The list command simply prints the content of your data set. To get the simplest list, click Data, Describe
data, List Data and click OK in the resulting window without changing anything. Alternatively, you can
type list or even just l (which Stata understands as an abbreviation for list).
This gives you a great deal of information, probably more than you want. Thus it's useful to limit the
command so it only lists what you want to see. This is typical in Stata: once you've chosen what you
want to do (i.e. picked a command) you then need to tell Stata what you want it to act on.
If Condition
If you want the command to only act on certain observations, you need to tell Stata which observations.
This takes the form of an If condition, and the command will only act on those observations where the
condition is true.
Click on the by/if/in tab, and in the If: box type gender==”m”. Click Submit and you'll get a listing of all
the employee names, age and salary of just those of the male employees.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 11
The reason you had to type two equals signs (==) between gender and “m” is that Stata uses the equals
sign for two different purposes. One equals sign is used for assignment when creating or changing
variables: gender =m means "make gender “m”." Two equals signs are used in tests and conditions. In
that form it's a question: "Is gender equal to m?"
Note that in the command the list of variables comes before the if condition:
list employee if gender = "m"
list make price mpg if gender = “m” & salary > 4000
This example will display all the male employees with a monthly salary of 4000 and more.
Lowercase and Uppercase Letters
Case matters: if you use an uppercase letter where a lowercase letter belongs, or vice versa, an error
message will display.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 12
Logical and Mathematical Functions
When creating expressions (for example, to use in generating a variable), it is necessary to know Stata’s
syntax for various functions. Here is an extensive list:
~ not
| or
& and
== equals
+ plus
minus
* multiplied by
/ divided by
^ raised to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
~= not equal to
Edit: Opens the data editor, to type in or paste data. You must close the data editor before you can run
any further commands.
Keep and Drop
The original dataset may contain variables you are not interested in or observations you don’t want to
analyze. It’s a good idea to get rid of these first – that way, they won’t use up valuable memory and
these data won’t inadvertently sneak into your analysis. You can tell Stata to either keep what you want
or drop what you don’t want – the end results will be the same.
keep ID (Keep ID only and drop all the other varibles)
or
drop ID (Will only drop ID and keep all the other varibles)
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 13
Stata Command Syntax Options
(1) a list of variables
(2) an if-statement
(3) options
A list of variables consists of the names of the variables, separated with spaces. It goes immediately
after the command. If you leave the list blank, Stata assumes that you mean all variables.
Examples:
edit var1 var2 var3 Opens the data editor, just with variables var1, var2, and var3.
edit Opens the data editor, with all variables.
Several Practices
edit var1 if var2 > 3 Opens the data editor, just with variable var1, only for observations in which var2 is
greater than 3.
edit if var2 == var3 Opens the data editor, with all variables, only for observations in which var2 equals
var3.
edit var1 in 10 Opens the data editor, just with var1, just in the 10th observation.
edit var1 in 101/200 Opens the data editor, just with var1, in observations 101-200.
edit var1 if var2 > 3 in 101/200 Opens the data editor, just with var1, in the subset of observations 101-
200 that meet the requirement var2 > 3.
Options alter what the command does. There are many options, depending on the command – get help
on the command to see a list of options. Options go after any variable list and if-statements, and must
be preceded by a comma. Do not use an additional comma for additional options (the comma works like
a toggle switch, so a second comma turns off the use of options!). Examples:
use "filename.dta", clear Reads in a Stata-format data file, clearing all data previously in memory!
(Without the clear option, Stata refuses to let you load new data if you haven’t saved the old data. Here
the old data are forgotten and will be gone forever unless you saved some version of them.)
save "filename.dta", replace Saves the data, replacing a previously-existing file if any. You will see more
examples of options below.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 14
Review Window, and Abbreviating Command Names
The Review window lists commands you typed previously. Click in the Review window to put a previous
command in the Command window (then you can edit it as desired). Double-click to run a command.
Another shortcut is that many commands can have their names abbreviated. For example below instead
of typing “summarize”, “su” will do, and instead of “regress”, “reg” will do.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 15
Basic Statistics
Stata has a large number of commands dedicated to basic statistics; we'll discuss some of the most
commonly used. Feel free to skip any you don't need.
The basic summarize command gives you a number of observations, means, standard deviations,
minimums and maximums. To use it, click Statistics, then Summaries, tables, and tests, then Summary
and descriptive statistics and finally Summary statistics. Select or type mpg in Variables, then click
Submit.
Alternatively, you could have just typed:
sum mpg
and gotten the exact same thing (sum being an abbreviation for summarize).
Missing values are ignored when calculating summary statistics. If you type:
sum rep78
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 16
you'll see that the number of observations is 69 rather than 74 like it was for mpg. Five observations
have missing values for rep78 and could not be included in the calculations, so the mean was calculated
over the 69 observations that do have valid values.
Variable lists, if conditions, by groups and options work for summarize just like they did for list in the
examples in part one. You've already seen a variable list in action (getting summary statistics for mpg or
rep78 rather than all variables). Next, find the mean of mpg for cars weighing over 4,000 pounds by
clicking by/if/in and typing weight>4000 in the If box.
The command for this is: sum mpg if weight>4000
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 17
Frequencies
The tabulate command is used to create frequency tables. It has two variants: one for one-way tables
and one for two-way tables. If you type tab, Stata will figure out version which you want by looking at
how many variables you list afterwards. But if you're using menus you'll click Statistics, Summaries,
tables, and tests, Tables and then either One-way tables or Two-way tables with measures of
association.
One-way Tables
A one-way table simply lists the values of a variable and how many times each value appears in your
data set. To create a one-way table, click Statistics, Summaries, tables, and tests, Tables and then One-
way tables. Select or type rep78 as the Categorical variable and click Submit.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 18
The resulting command is:
tab rep78
Note that the missing values of rep78 were not included in the table. If you want to see how many
missing values you have, you should check Treat missing values like other values. Then they'll get their
own entry.
Two-way Tables
Two-way tables tell you how many times each combination of two variables appears in your data. To
create a two-way table, click Statistics, Summaries, tables, and tests, Tables and then Two-way tables
with measures of association. Select or type rep78 for the Row variable and foreign for the Column
variable, then click Submit.
Note that missing values do not appear in the table unless you check Treat missing values like other
values, just like with one-way tables.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 19
Correlations
To calculate the correlation between variables, click Statistics, then Summaries, tables, and tests, then
Summary and descriptive statistics and finally Correlations and covariances. Then type the names of the
variables you want the correlations for in the Variables box. This data set has several variables relating
of the size of the cars: weight, length and displacement (a measure of the size of the engine). We would
expect them to be highly correlated, but type all three in the Variables box and click Submit to verify
that hypothesis.
The command is:
correlate weight length displacement
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 20
Hypothesis Tests
To use a t-test to test the hypothesis that the mean of a variable is equal to some number, click
Statistics, then Summaries, tables, and tests, then Classical tests of hypotheses and finally One-sample
mean-comparison test. Select or type mpg as the Variable name and then type 20 in Hypothesized
mean. Click OK and Stata will test the hypothesis that the mean of mpg is 20.
The command is:
ttest mpg==20
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 21
Code: Mean, Variance, Number of Non-missing Observations, Minimum, Maximum, Etc.
summarize varlist See summary information for the variables listed.
tabulate varname: Creates a table listing the number of observations having each different value of the
variable varname.
tabulate var1 var2: Creates a two-way table listing the number of observations in each row and column.
tabulate var1 var2, exact: Creates the same two-way table, and carries out a statistical test of the null
hypothesis that var1 and var2 are independent. The test is exact, in that it does not rely on convergence
to a distribution.
tabulate var1 var2, chi2: Same as above, except the statistical test relies on asymptotic convergence to a
normal distribution. If you have lots of observations, exact tests can take a long time and can run out of
available computer memory; if so, use this test instead.
histogram varname: Plots a histogram of the specified variable.
histogram varname, bin(#) normal: The bin(#) option specifies the number of bars. The normal option
overlays a normal probability distribution with the same mean and variance.
kdensity varname, normal: Creates a “kernel density plot”, which is an estimate of the pdf that
generated the data. The “normal” option lets you overlay a normal.
scatter yvar xvar: Plots data, with yvar on the vertical axis and xvar on the horizontal axis.
Correlations and Covariances
The following commands compute the correlations and covariances between any list of variables. Note
that if any of the variables listed have missing values in some rows, those rows are ignored in all
calculations.
correlate var1 var2 … Computes the sample correlations between variables.
correlate var1 var2 …, covariance: Computes the sample covariances between variables.
Sometimes you have missing values in some rows, but want to use all available data wherever possible –
i.e., for some correlations but not others.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 22
Generating and Changing Variables
A variable in Stata is a whole column of data. You can generate a new column of data using a formula,
and you can replace existing values with new ones. Each time you do this, the calculation is done
separately for every observation in the sample, using the same formula each time.
Generating Variables
generate newvar = … Generate a new variable using the formula you enter in place of “…”.
Examples follow.
gen f = m * a Remember, Stata allows abbreviations: “gen” means “generate”.
gen xsquared = x^2
Replacing Values of Variables
replace agesquared = age^2 Changes the value of the variable agesquared, to equal age squared. This
would be useful if you had made a mistake when you first created the variable.
replace young = age < 16 if age<. Changes the value of the variable young, to equal 1 if and only if age is
less than 16, and 0 otherwise. The “if age<.” Ensures that replacements are only made when values of
age are not missing
replace young = 0 if age>=16 & age<18 Changes the value of the variable young to 0, but only if age is at
least 16 and less than 18. That is, no change is made if age is less than 16 or if age is at least 18.
If-then-else Formulas
gen val = cond(a, b, c) Stata’s cond(if, then, else) works much like Excel’s IF(if, then, else). With the
statement cond(a,b,c), Stata checks whether a is true and then returns b if a is true or c if a is not true.
gen realwage = cond(year==1992, wage*(188.9/140.3), wage) Creates a variable that uses one formula
for observations in which the year is 1992, or a different formula if the year is not 1992.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 23
Graphs: Histograms
To make a histogram, click Graphics, Histogram then select or type mpg as the Variable. The mpg
variable is continuous (in theory anyway) so leave Data are continuous selected. Under Y axis select
Frequency to have the bar heights labeled in terms of number of observations in each bin. Click to
Submit to see the results.
The command is:
hist mpg, freq
You can choose either the number of bins or the width of each bin (one implies the other). Check the
box by Number of bins, type in 20 and click Submit again to see the difference it makes.
If you tell Stata that your variable is discrete the resulting histogram will have one bin for each unique
value of the variable. Change Variable to rep78, choose Data are discrete and click Submit. Note that
that the checkboxes under Bins are grayed out.
STATA: Data Analysis and Statistical Software
University Academic Computing Technologies | Introduction to STATA, Data Statistics Analysis 24