STATA Training

Training Module 1: Using Stata for Survey Data Analysis

Project: Poverty mapping and market access in Vietnam

Funding:

New Zealand Embassy with coordination by The World Bank

Implementation:

International Food Policy Research Institute (IFPRI) and the Institute for Development Studies (IDS)

Lead Trainer:

Nicholas Minot, IFPRI

Dates: 5-9 August 2002

Host institutions: Information Center for Agriculture and Rural Development

Ministry of Agriculture and Rural Development with

the Ministry of Labor, Invalids, and Social Affairs and the Ministry of Planning and Investment

Hanoi, Vietnam

Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project

N. Minot Page 1-1

Background This is the first of three one-week training modules offered as part of the project �Poverty mapping and market access in Vietnam.� The project is funded by the Embassy of New Zealand and implemented by the International Food Policy Research Institute (IFPRI) in Washington, D.C. and the Institute for Development Studies (IDS) in Sussex, England. The training modules will cover the following topics:

1. Using Stata for survey data analysis 2. Introduction to geographic information systems (GIS) 3. Poverty mapping methods: Combining census and survey data

Four characteristics of these modules need to be emphasized because they have implications for the role of the participants.

• The training modules are not lecture courses, but rather they are semi-structured hands-on workshops in which trainees will use computers to learn different methods of analyzing data. Thus, active participation of the trainees is expected and necessary to maximize the benefit from the training.

• The training modules focus on how to use computer software to implement a wide range of

topics and analytical methods. In order to cover this range of methods, the course cannot provide detailed explanations of the statistical methods themselves, so it is assumed that trainees have some familarity with concepts such as means, frequency distributions, and regression analysis.

• The training modules are cumulative in the sense that understanding the material of one day

depends on having attended the training course the day before. If you cannot attend the course every day for the full day, it will be difficult to understand the new materials. For this reason, we will ask those who cannot attend regularly to withdraw to make space for other trainees.

• The training modules will be offered in English. Trainees are not expected to understand all

the technical terms used in the course, but they should have a solid understanding of conversational English in order to take full advantage of the training.

At the end of each module, we will issue Certificates of Completion to each trainee who has attended all the sessions and mastered the concepts taught in the course. We reserve the right not to issue Certificates to trainees who do not attend all sessions and those who do not master the material taught. Objectives The objective of this training module is to improve the ability of the trainees to use Stata to generate descriptive statistics and tables from survey data, as well as carry out multiple linear regression analysis of those data. In particular, the course aims to train the participants in the following methods:

• basic file management such as opening, modifying, and saving files • advance file management such as merging, appending, and aggregating files • documenting data files with variable labels and value labels • generating new variables using various functions and operations • creating tables to describe the distribution of continuous and discrete variables • creating tables to describe the relationships between two or more variables • using regression analysis to study the impact of various variables on a dependent variable • testing hypotheses using statistical methods


N. Minot Page 1-2

Course requirements In order to take full advantage of the materials taught in the course, trainees must have the following background:

• Conversational English that allows them to follow the instructions of the trainer • Basic statistics such as familiarity with the concepts of means, variance, frequency

distributions, and regression analysis • Familiarity with computers, including the keyboard and mouse

Organization of the course The training course is divided into ten sections. We will cover some material in all 10 sections, but we may not be able to cover all the material, depending on the background of the trainees. Section 1: Introduction to survey data files Section 2: Introduction to Stata Section 3: Exploring data files with Stata Section 4: Saving and using Stata output Section 5: Creating new variables Section 6: Making tables to describe data Section 7: Making graphs Section 8: Modifying data files Section 9: Introduction to programming with Stata Section 10: Regression analysis with Stata Each section will include some training in the use of Stata commands and a practical application of these commands to the analysis of the 1998 Vietnam Living Standards Survey (VLSS). The VLSS contains over one hundred files, but we will focus our attention on the following files: Table 1. Sample data programs from the 1998 VLSS

Questionnaire section Topic Level File name Extraced from various Household characteristics Household hhexp98n.dta Section 1A List of household members Individual scr01a2.dta Section 2 Education Individual scr02a.dta Section 6A Type of housing Household scr06a.dta Section 6B p1 Housing expenses Household scr06b1.dta Section 6B p2 Housing expenses Household scr06b2.dta Section 6C Housing characteristics Household scr06c.dta Section 9B1 Rice production Crop scr09b1.dta Section 9B2 Other food crop production Crop scr09b2.dta Section 9B4 Perennial cash crop production Crop scr09b4.dta


N. Minot Page 1-3

SECTION 1: INTRODUCTION TO SURVEY DATA FILES 1. List of useful terms The following are some key concepts that will be used throughout this training module. Most of you will be familiar with them, but it is worth reviewing the terms for those that may not know all of them.. Records (or cases or observations) are individual observations such as individuals, farm plots, households, villages, or provinces. They are usually considered to be the �rows� of the data file. For example, data set A (below) has 5 records and data set B has 6 records. The VLSS files usually have between 6000 and 120,000 records. Variables are the characteristics, location, or dimensions of each record. They are considered the �columns� of the data file.

• In data set A (below), there are four variables: the household identification number, the region where the household lives, the size of the household, and the distance from the house to the nearest source of water.

• In data set B, there are six variables: the region, province, household, plot number, whether or not it is irrigated, and the size of the plot.

The VLSS files usually have between 10 and 30 variables The level of the dataset describes what each record represents. For example,

• In data set A (below), each record is a different household, so it is a household-level data set. • In data set B (below), each record is a farm plot, it is a plot-level data set. Note that more

than one record has the same household identification number. Data set A HHID REG HHSIZE DISTWAT 3456 1 5 1.5 3457 1 5 0.4 3458 1 4 0.6 3459 2 2 5.1 3460 3 8 1.2 Data set B REG PROV HH PLOT IRRIG AREA 1 4 1 1 1 1.5 1 4 1 2 0 1.0 1 5 3 1 1 0.5 2 26 2 1 0 0.4 2 26 2 2 1 1.0 3 45 1 1 1 1.2 Key variables are the variables that are needed to identify a record in the data. In data set A, the variable HHID is enough to uniquely identify the record so HHID is the only key variable. In data set B, the key variables are REG, PROV, HH, and PLOT because all four variables are needed to uniquely identify the record. The first two records have the same region, province, and household, so these three variables are not enough to uniquely identify a record.


N. Minot Page 1-4

Discrete variables (or categorical variables) are variables that have only a limited number of different values. Examples include region, sex, income category, type of roof, and education level. Yes/no variables such as whether a household has electricity are also discrete variables. Binary variables (or dummy variables) are a type of discrete variable that only takes two values. They may represent yes/no, male/female, have/don�t have, or other variables with only two values. Continuous variables are variables whose values are not limited. Examples include income, farm size, number of trees, rice consumption, coffee production, and distance to the road. Unlike discrete variables, continuous variables are usually expressed in some units such as Vietnamese dong, kilometers, hectares, or kilograms and may take fractional values (4.5639). Variable labels are longer names associated with each variable to explain them in tables and graphs. For example, the variable label for HHSIZE might be �Household size� and the label for DISTWAT could be �Distance to water (km)�. Whenever possible, variable labels should include the unit (e.g. km). Value labels are longer names attached to each value of a variable. For example, if the variable REG have eight values, each value is associated with a name. REG=1 could be �Northeast Region�, REG=2 could be the �Northwest Region�, and so on. 2. Structure of 1998 VLSS data files The Vietnam Living Standards Survey was carried out in 1992-93 and 1997-98. In this section, we describe the 1997-98 VLSS, although most of the description fits the earlier survey as well since the questionnaire and data files are quite similar. The 1998 VLSS had three types of questionnaires: a household questionnaire, a community questionnaire, and a price questionnaires. Here we focus on the household questionnaire. Household questionniare The household questionnaire consists of 116 files and about 60 Mb of data (in Stata format). The files cover the following topics: Section 1: Household members Section 2: Education Section 3: Health Section 4: Employment Section 5: Migation Section 6: Housing Section 7: Respondents for 2nd round Section 8: Fertility Section 9: Agriculture, forestry, and fishery activities Section 10: Non-farm self-employment Section 11: Food expenditures Section 12: Non-food expenditures and durable goods Section 13: Income from remittances Section 14: Borrowing, lending, and savings Each file contains the data for on section or sub-section of the questionnaire, usually covering several pages of the questionnaire. The file names include the section number, the part letter, and sometimes a number indicating the sub-part. For example, in the file


N. Minot Page 1-5

scr09B4.dta

09 refers to Section 9, B refers to Part B, 4 refers to the 4th sub-part of Part B, and .dta is the file extension for Stata data files.

Section 9 covers agriculture, Part B covers crop production, and Part B4 covers permanent industrial crops such as tea, coffee, and rubber. Within each file, the variables are named according to the section and question number. For example, in the variable:

s9b4q031

s9b4 refers to Section 9, Part B4, q02 refers to question 2, and 1 refers to the 1st column within the question

The variable s9b4q031 gives the area planted with a given crop, expressed in terms of hectares or number of trees. The next variable, s9b4q032, indicates whether the area is expressed in hectares or trees.


N. Minot Page 1-6

SECTION 2: INTRODUCTION TO STATA When you open Stata, you will see a menu bar across the top, a tool bar with buttons, and 3-5 windows (the number of windows open depends on which windows were open the last time Stata was used). Each is described briefly below. 1. Menu bar The menu bar has lists of commands that can be opened by clicking on a word. Below we provide a quick description of the different options. If you use Stata a lot, you probably will not use the menu bar often because the most common tasks can be done with the buttons on the tool bar and key-strokes. File Open Open data file View View data file (only in Stata 7) Save Save data file Save as Save data file under new name File name Select data file name to put in command Log Open, close, review, or convert log file Save graph Save file with graph Print graph Print graph Print results Print contents of current window (only in Stata 7) Exit Leave Stata Edit Copy text Copy marked text (Control-C can also be used to copy) Copy tables Copy tables to insert in spreadsheet or word processor Paste Insert something previously copied (Control-V will also paste) Table copy options Options for how tables are copied Graph copy options Options for how graphs are copied (not in Stata 7) Prefs Various options for setting preferences. For example, you can save a

particularly layout of the different Stata windows or change the colors used in Stata windows.

Window Results Bring output window to front Graph Bring graph window to front Log Bring log window to front Viewer Open help window (only in Stata 7) Command Bring command window to front Review Bring list of recent commands to front Variables Bring list of variables to front Help/search Open help window (not in Stata 7) Data editor Open window to look at data Do-file editor Open window to write a new program (�Do� file) or edit an existing one Help Contents Information on Stata organized by topic Search Search for information on a certain topic Stata command Search for information on certain Stata command What�s new Differences between different versions of Stata other options allow you to access web sites with Stata news and information


N. Minot Page 1-7

2. Tool bar The buttons on the tool-bar are designed to make it easier to carry out the most common tasks. The left column describes the button on the toolbar, while the right column tells what the button does. Open folder Use data file Diskette Save data file in memory to disk Printer Print contents of current window Scroll with traffic light Open, close, or view log file Scroll without light Bring log window to front (not in Stata 7) Eye Open window with help on using Stata (only in Stata 7) Box 1 Bring Dialog Window to front Box 2 Bring Results Window to front Box 3 Bring Graph Window to front Envelope Open window to write a new program (�Do� file) or edit an existing one Table Open window to view and edit data Table and circle Open window to view data Go Turn off �More� X Stop processing 3. Stata windows The Stata windows give you all the key information about the data file you are using, recent commands, and the results of those commands. Some of them open automatically when you start Stata, while others can be opened using the Windows pull-down menu or the buttons on the tool bar. These are the Stata windows: Stata Results To see recent commands and output Stata Command To enter a command Stata Browser To view the data file (needs to be opened) Stata Editor To edit the data file (needs to be opened) Stata Viewer To get help on how to use Stata Variables To see a list of variables Review To see recent commands Stata Do-file Editor To write or edit a program (needs to be opened) Each is described in more detail below. Stata Results This window (with the black backgound) shows all recent commands, output, error messages, and help info. In Stata 7, the text is color-coded as follows: white Stata commands green General information and the frame and headings of output tables blue Commands or error messages that can be clicked on for more information (in Stata 7 only) yellow Numbers in output tables red Error messages The slide bar on the right side can be used to look at earlier results that are not on the screen. However, unlike SPSS, the Stata results window does not keep all output generated. It will keep about 300-600 lines of the most recent output, deleting earlier output. If you want to store output in a file, you must use the log command.


N. Minot Page 1-8

Stata Command This window (small with a white background) allows you to enter commands which will be executed as soon as you press the Return key. You can also use recent commands again by using the PageUp key (to go to the previous command) and PageDown key (to go to the next command). Stata Browser This window shows all the data in memory. The Stata Browser does not appear automatically when you start Stata. The only way to open the Browser is to click on the buttom with a table and magnifying glass. Unlike SPSS, when the Stata Browser is open, you cannot execute any commands, either from the Stata Command window or from the Do-file Editor. In addition, you also cannot change any of the data. You can, however, sort the data or hide certain variables using buttons at the top of the Stata Browser window. Stata Editor This window is exactly like the Stata Browser window except that you can change the data. We do not recommend using this window because you will have no record of the changes you make in the data. It is better to correct errors in the data using a Do-file program that can be saved. Stata Viewer This window provides help on Stata commands and rules. To open the Stata Viewer window, you can click on Windows/Viewer or click on the eye button on the tool bar. To use the Stata Viewer window, type a command in the space at the top and the Viewer will give you the purpose and rules for using that command, along with some examples. Any blue text in the Viewer can be clicked on for more information about that command. Variables This window (tall with a white background) lists all the variables that exist in memory. When you open a Stata data file, it lists the variables in the file. If you create new variables, they will be added to the list of variables. If you delete variables, they will be removed from the list. You can insert a variable into the Stata Command window by clicking on it in the Variables window. Review This window (with a white background) lists all the recent commands. If you click on one of the commands, it appears in the Stata Command window and can be executed by pressing the Return key. The slide bar can be used to view earlier commands. Do-file Editor This window allows you to write, edit, save, and execute a Stata program. A Stata program (or Do-file) is simply a set of Stata commands written by the user. The advantage of using the Do-file Editor rather than the Stata Command window is that the Do-file allows you to save, revise, and rerun a set of commands. Exploratory analysis of the data can be done with the Stata Command window, but any serious data analysis should be carried out using the Do-file Editor, not the Stata Command window. The Do-File Editor can be opened by clicking on Windows/Do-file Editor or by clicking on the envelope button. With so many windows, it is sometimes difficult to fit them all on the screen. You can adjust the size and position of each window the way you like it and then save the layout by clicking on Prefs/Save Windowing Preferences. Each time you open Stata, the windows will be arranged according to your prefered layout. Table 2 (below) provides a list of Stata commands that will be introduced in Module 1:


N. Minot Page 1-9

Table 2. Stata commands and topics covered in Module 1

3. Exploring data clear use describe list summarize tabulate tab1 tab2 save help by prefix if suffix in suffix set more set mem set scrollbufsize 4. Storing commands and output Stata Do-file editor log exporting tables 5. Creating new variables gen replace operators functions recode tab �, generate xtile

6. Making tables labeling data #delimit tabulate � summarize tabstat table using weights 7. Graphs graph histogram scatterplot bar

xlabel ylabel connect( ) symbol( )

8. Modifying files drop drop if keep keep if sort compress collapse merge append fillin reshape

9. Programming creating and using macros creating and using loops matrix algebra 10. Regression analysis regress test testparm predict probit ovtest hettest


N. Minot Page 1-10

SECTION 3: EXPLORING DATA FILES This section covers commands that are used for preliminary exploration of data in a file. The following commands and topics are described:

clear use describe list summarize tabulate by prefix if suffix in suffix save help set mem set more set scrollbufsize

clear The clear command deletes all files, variables, and labels from the memory to get ready to use a new data file. You can clear memory using the clear command or by using the clear subcommand as part of the use command (see the use command). This command does not delete any data saved to the hard-drive. use This command opens an existing Stata data file. It is equivalent to �get� in SPSS. The syntax is: use filename [, clear ] opens new file use [varlist] [if exp] [in range] using filename [, clear ] opens selected parts of file

• If there is no extension, Stata assumes it is .dta. • If there is no path, Stata assumes it is in the current folder. • You can use a path name such as: use d:\data\scr02a • If the path name has spaces, you must use double quotes: use �d:\my data\scr02a� • You can open a selected variables of a file using a variable list. • You can open selected records of a file using if or in.

Here are some examples of the use command:

use hhexp98n opens the file hhexp98n.dta for analysis. use hhexp98n if reg7 == 1 opens data from one region

use hhexp98n in 5/25 opens records 5 through 25 of file use househol age sex using hhexp98n opens 3 variables from hhexp98n file

use d:\data\VLSS\scr01a2 opens the file scr01a2.dta in the specified folder use �d:\data files\VLSS 98\scr01a2� use quotation marks if there are spaces use scr01a2, clear clears memory before opening the new file


N. Minot Page 1-11

describe This command provides a brief description of the data file. You can use �des� and Stata will understand. The output includes:

• the number of variables • the number of observations (records) • the size of the file • the list of variables and their characteristics •

Example 1: Using �describe� to show information about a data file

It also provides the following information on each variable in the data file:

• the variable name • the storage type: byte is used for binary variables, int is used for integers, and float is used for

continuous variables that may have decimals. To see the limits on each storage type, type �help datatypes�

• the display type indicates how it will appear in the output. • the value label is the name of a set of labels for different values • the variable label is a name for the variable that is used in output.

Example 1 gives the description of the summary file from the VLSS called hhexp98n.

. use hhexp98n . describe Contains data from hhexp98n.dta obs: 5,999 vars: 67 6 Jan 2000 08:43 size: 1,553,741 (98.5% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- househol long %12.0g household code year float %9.0g Year of interview month float %9.0g Month of interview vlssmphs byte %8.0g 1 if vlss, 2 if mphs source sex byte %8.0g Gender of HH.head (1:M;2:F) age int %8.0g Age of household head agegroup byte %8.0g agegroup age group of HH.head comped98 float %9.0g diploma completed diploma HH.head educyr98 float %9.0g schooling year of HH.head farm float %9.0g loaiho Type of HH (1:farm; 0:nonfarm) urban98 byte %8.0g urban 1:urban 98; 0:rural 98 urban92 float %9.0g urban 1:urban92; 0:rural92 province float %9.0g Province code reg7 int %8.0g Code by 7 regions reg8 int %8.0g Code by 8 regions reg10 int %8.0g Code by 10 regions hhsize long %12.0g Household size hhcat float %9.0g hhsize categories wt int %8.0g sample weight hhsizewt float %9.0g =hhsize*wt vill float %9.0g village code [output truncated hee)


N. Minot Page 1-12

list This command lists values of variables in data set. It is similar to �list� in SPSS. The syntax is: list [varlist] [if exp] [in range] With varlist, you can specify which variable�s values will be presented. If no list is specified, all variables will be listed. With if and in, you can specify which records will be listed. Here are some examples: . list lists entire dataset . list in 1/10 lists observations 1 through 10 . list househol reg7 lists selected variables . list househol age in 1/20 lists observations 1-20 for selected variables . list if reg7 < 6 lists cases in region is 1 through 5 Example 2: Using �list� to look at data

If you are not careful with list, you will get a lot more output than you want. If Stata starts giving you more output than you really want, use the stop buttom (red button with an X).

. use hhexp98n . list househol urban98 reg8 in 1/10 househol urban98 reg8 1. 101 Urban 1 2. 103 Urban 1 3. 105 Urban 1 4. 107 Urban 1 5. 108 Urban 1 6. 109 Urban 1 7. 110 Urban 1 8. 111 Urban 1 9. 112 Urban 1 10. 113 Urban 1 . list househol reg8 vill if vill==32 househol reg8 vill 482. 3201 7 32 483. 3203 7 32 484. 3205 7 32 485. 3206 7 32 486. 3207 7 32 487. 3208 7 32 488. 3215 7 32 489. 3216 7 32 490. 3218 7 32 491. 3221 7 32 492. 3222 7 32 493. 3223 7 32 494. 3224 7 32 495. 3225 7 32 496. 3226 7 32 497. 3227 7 32


N. Minot Page 1-13

summarize The summarize command produces statistics on continuous variables like age, income, farm size, or This is like the �describe� command in SPSS. The syntax looks like this: summarize [varlist] [if exp] [in range] [, [detail]] By default, it produces the following statistics:

• Number of observations • Average (or mean) • Standard deviation • Minimum • Maximum

If you specify �detail�, Stata gives you additional statistics.such as

• skewness, • kurtosis, • the four smallest values • the four largest values • various percentiles.

Here are some examples: . summarize gives statistics on all variables . summarize age income gives statistics on selected variables . summarize age income if reg8==3 gives statistics on two variables for one region Example 3. Using �summarize� to study continuous variables

The first example gives the statistics for the whole sample, while the second gives the statistics only for households in Region 3, the Red River Delta. Notice that residents in the Red River Delta are somewhat younger but with more education than the national averages.

. sum age educyr98 food ricexpd Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- age | 5999 48.01284 13.7702 16 95 educyr98 | 5999 7.094419 4.416092 0 22 food | 5999 7272.777 4634.887 542.1666 85499.25 ricexpd | 5999 2267.346 1140.367 0 9792 . sum age educyr98 food ricexpd if reg8==3 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- age | 128 42.07031 11.38184 24 79 educyr98 | 128 8.609375 3.224499 0 17 food | 128 5290.059 1756.087 1795 13022.08 ricexpd | 128 2735.59 1081.493 747 7344


N. Minot Page 1-14

tabulate, tab1, tab2 These are three related commands that produce frequency tables for discrete variables. They can produce one-way frequency tables (tables with the frequency of one variable) or two-way frequency tables (tables with a row variable and a column variables. These commands are similar to the �freuqncy� and �crostab� commands in SPSS. How do they differ?

• tabulate or tab produce a frequency table for one or two variables • tab1 produces a one-way frequency table for each variable in the variable list • tab2 produces all possible two-variable tables from the list of variables

You can use several options with these commands:

• all gives all the tests of association for two-way tables • cell gives the overall percentage for two-way tables • column gives column percentages for two-way tables • row gives row percentages for two-way tables • nofreq suppresses printing the frequencies. • chi2 provides the chi squared test for two-way tables

There are many other options, including other statistical tests. For more information, type �help tabulate�. Some examples of the tabulate commands are: . tabulate reg7 produces table of frequency by region . tabulate reg8 sex produces a cross-tab of frequencies by region and sex . tabulate reg8 sex, row produces a cross-tab by region and sex with row percentages . tabulate reg8 sex, cell nofreq produces a cross-tab of overall percentages by region and

sex . tab1 reg8 sex ethnic produces three tables, a frequency table for each variable . tab1 region sex ethnic produces three tables, a frequency table for each variable . tab2 reg8 sex urban98 produces three tables, a cross-tab of each pair of variables


N. Minot Page 1-15

Example 4. Using �tabulate� on categorical variables

No

. tab farm Type of HH | (1:farm; | 0:nonfarm) | Freq. Percent Cum. ------------+----------------------------------- non farm | 2561 42.69 42.69 farm | 3438 57.31 100.00 ------------+----------------------------------- Total | 5999 100.00 . tab sex farm Gender of | Type of HH (1:farm; HH.head | 0:nonfarm) (1:M;2:F) | non farm farm | Total -----------+----------------------+---------- 1 | 1673 2702 | 4375 2 | 888 736 | 1624 -----------+----------------------+---------- Total | 2561 3438 | 5999 . tab sex farm, row col chi2 Gender of | Type of HH (1:farm; HH.head | 0:nonfarm) (1:M;2:F) | non farm farm | Total -----------+----------------------+---------- 1 | 1673 2702 | 4375 | 38.24 61.76 | 100.00 | 65.33 78.59 | 72.93 -----------+----------------------+---------- 2 | 888 736 | 1624 | 54.68 45.32 | 100.00 | 34.67 21.41 | 27.07 -----------+----------------------+---------- Total | 2561 3438 | 5999 | 42.69 57.31 | 100.00 | 100.00 100.00 | 100.00 Pearson chi2(1) = 130.8340 Pr = 0.000


N. Minot Page 1-16

• In one-way tables, Stata gives the count, the percentage, and the cumulative percentage (see first example in box).

• In two-way tables, Stata gives the count only, unless you ask for other statistics (see second example in box)

• col, row, and cell request Stata to include percentages in two-way tables by This prefix goes before a command and asks Stata to repeat the command for each value of a variable. There is no equivalent command in SPSS. The general syntax is: by varlist: command Some examples of the by prefix are: by sex: sum hhsix for each sex of head of household, give stats on household size by reg8: tab urban98 for each region, give the frequency table of urban/rural Example 5. Using the �by� prefix

save This command saves the data in memory. It is equivalent to �save outfile� in SPSS. The syntax is: save [filename] [, replace ] saves file

• If you do not give a file name, it will use the current name. • You cannot write over an old file unless you specify �replace� (unlike in SPSS).

. sort sex . by sex: sum hhsize _______________________________________________________________________________ -> sex = 1 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- hhsize | 4375 5.058286 1.852724 1 16 _______________________________________________________________________________ -> sex = 2 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- hhsize | 1624 3.927956 1.982762 1 19


N. Minot Page 1-17

if We have already seen several examples of using if to select certain records in carrying out a command. This is similar to the �process if� command in SPSS, except that in Stata it is not considered a separate command. The syntax is: command if exp Examples include: . list hhid region income if income>12000 lists data if income is above 12000 . tab region if income>10000 & income<20000 frequency table of region if income is in range . summarize income if region==1 | region==2 statistics on income for regions 1 and 2 Note that �if� statements always use ==, not a single =. Also note that | indicates �or� while & indicates �and�. in We have also used in to select records based on the case number. The syntax is: command in exp For example: . list in 10 list observation number 10 . summarize in 10/20 summarize observations 10-20 . help The help command gives you information about any Stata command or topic help [command] For example, . help tabulate gives a description of the tabulate command . help summarize gives a description of the summarize command set The set command is used to control the Stata operating environment. There are 22 set commands, but many of them are rarely used. Some of the more common ones are: set mem XXm sets memory for Stata at XX megabytes. If you get the error message �No room to add more observations�, this means the datafile is too big for the memory allocated to Stata. This command increases the memory allocated to Stata. You cannot set XX greater than the RAM memory in the computer. set more off/on is used to turn on and off the continuous scrolling of output. Use �set more off� if you are not interested in the intermediate output, only the final result. Use �set more on� if you need to be able to read the early output. Remember that the Results Window only stores the most recent 300-600 lines of output. Unlike SPSS, Stata does not automatically store all of your output.


N. Minot Page 1-18

set scrollbufsize XX is used to change the amount of output that Stata will store. XX is expressed in bytes. The default is 32,000 (32k) and the maximum is 500,000 (500k). Type �help set� for a list of other settings in Stata. Exercises for exploring the VLSS This section includes some questions that you can answer using the VLSS files provided on your computer and the commands described in this section1. Remember two tricks to make it easier to fix your mistakes:

• You can use PageUp to retrieve the most recent command. • You can click on variables in the Variable window to paste it into the Command window.

Summary file The file hhexp98n contains summary variables calculated from various other data files. It is at the househlold level. Open the file by entering �use hhexp98n�. in the Command window and pressing Return.

1. How many variables and how many records are in hhexp98n? (Answer: describe)

2. What percentage of households have female heads? (Answer: tab sex)

3. Is there a statistically significant difference between the percentage of female-headed households in urban and rural areas? (use the chi2 option)

4. What percentage of urban households are considered farm household? (use �if urban98==1�

option)

5. What percentage of farm households are in urban areas?

6. How does the percentage of female headed household vary by region?

7. What is the average size of a household?

8. What is the average size of an urban household in the Red River Delta? (reg8=1 refers to RRD)

9. How does household size vary with across expenditure quintiles? (use quint98b for quintiles,

you will need to sort and then use by) Household members The file scr01a2 contains information about each member of the household. It is at the individual level (each record is a person). You can answer the following questions using this file:

1. What percentage of the population is female? (Answer: tab s1aq02)

2. What percentage of the population over 80 years old is female? (use �tab � if ..�)

3. What percentage of the population under 5 is female? 1 To get the correct answers, we should use the sample weights which are described later. The weights compensate for the fact that some types of households are over-represented in the VLSS sample and others are under-represented. For example, urban households make up 29 percent of the sample, but only 24 percent of the population. Sampling weights are described in Section 6.


N. Minot Page 1-19

4. What percentage of women are married?

5. What percentage of the women over the age of 20 are married?

6. Does this percentage vary between urban and rural areas?

7. What percentage of the spouses of family members live in the household?

8. Is the percentage of spouses away greater for men or for women? Housing characteristics The file scr06b1 contains information about the characteristics of houses. Open the file and use �des� to obtain a list of variables. The following questions can be answered from this file.

1) What is the average value of the house, according to the respondent? (Answer: sum s6bq12)

2) What are the most important sources of water? (use �tab�)

3) Of those households that think their water is safe before boiling, what percentage boil their water before drinking?

4) What is the average value of the house among those who get their water from an inside

private tap?

5) What is the average value of the house among those who get their water from a hand-dug well?

6) What is the average value of the house for each type of source of drinking water? (you will

need to sort by drinking water type and then use the �by� option) Food crops The file scr09b2 contains information on production of food crops other than rice. The data are at the crop level, meaning that each record represents one crop for one household. Only crops that are grown by each household are included in the file. The crop codes are in the questionnaire on pages before and after the questions. You can answer the following questions with this file.

1. How many households in the sample grow maize? (Answer: tab s9b2cc)

2. Among maize growers, what was the average area with maize? (Answer: sum s9b2q03 if s9b2cc==8)

3. Among maize growers, what was the average amount of maize harvested, sold, and given to

livestock?

4. Among farmers with more than 1 hectare of maize, what was the average amount of maize harvested, sold, and given to livestock? (you will need an �if� statement that selects both for maize and for area greater than 10,000 m2)

5. What is the average amount harvested and sold for each food crop other than rice? (you will

need to sort and use �by s9b2cc�)

6. Farmers were asked what percentage of the normal harvest did they get this year, so 100% means normal. What was the average response?

7. How much are the post-harvest losses in maize relative to the size of the harvest? Tomatoes?


N. Minot Page 1-20

SECTION 4: STORING COMMANDS AND OUTPUT In this section, we discuss how to store commands and output for later use. First, we describe how to store commands a program (Stata calls it a Do-file) , how to edit the program, and how to run it. Second, we present different ways of saving and using the output generated by Stata. The following topics are covered:

using the Do-file Editor log using log off log on log close set logtype

moving tables from Stata to Word and Excel Using the Do-file Editor As mentioned in Section 2, the Do-file Editor allows you to store a program (a set of commands) so that you can edit it and execute it later. Why use the Do-file Editor?

• It makes it easier to check and fix errors, • It allows you to run the commands later, • It lets you show others how you got your result, and • It allows you to collaborate with others on the analysis.

In general, any time you are running more than 10 commands to get a result, it is easier and safer to use a Do-file to store the commands. To open the Do-file Editor, you can click on Windows/Do-file Editor or click on the envelope on the Tool Bar. Within the Do-file Editor, there is a menu bar and tool bar buttons to carry out a variety of editing functions. The menu bar is similar to the one in Microsoft Word:

File/New to open a new, blank Do-file File/Open to open an existing Do-file File/Save to save the current Do-file File/Save as to saving the current Do-file under a new name File/Insert file to insert another file into the current one File/Print to print the Do-file File/Close to close the Do-file Edit/Undo to undo the last command Edit/Cut to delete or move the marked text in the Do-file Edit/Copy to copy the marked text in the Do-file Edit/Paste to insert the copied or cut text into the Do-file

Search/Find to find a word or phrase in the Do-text Search/Replace to find and replace a word or phrase in the Do-file Tools/Do to execute all the commands or the marked commands in the Do-file Tools/Run to execute all the commands or the marked commands in the Do-file without showing any output in the Stata Results window The tool bar buttons can be used to carry out some of these tasks more quickly. For example, there are buttons for File/New, File/Open, File/Print, Search/Find, Edit/Cut, Edit/Copy, Edit/Paste, Edit/Undo, Do, and Run. Probably the button you will use most is the second-to-last one that shows a


N. Minot Page 1-21

page with text on it. This is the �Do� button for executing the program or the marked part of the program. Finally, the keyboard commands may be even quicker to use than the buttons. The most useful keyboard commands are:

Control-O Open file Control-S Save file Control-C Copy Control-X Cut Control-V Paste Control-Z Undo Control-F Find Control-H Find and Replace

To run the commands in a Do-file, you can click on the Do button (the second-to-last one) or click on Tools/Do. If you want to run one or just a few commands rather than the whole file, mark the commands and click on the Do button. You do not have to mark the whole command, but at least one character in the command must be marked in order for the command to be executed (unlike SPSS, it is not enough to have the cursor on a command). Although layout is a matter of personal preference, it may be useful to have the Stata Results window and the other windows on one side of the screen and the Do-file Editor window on the other. This makes it easy to switch back and forth. When you arrange the windows the way you like, you can save the layout by clicking Prefs/Save Windowing Preferences. Each time you open Stata, it will use your chosen layout. Saving the output As mentioned in Section 2, the Stata Results window does not keep all the output you generate. In only stores about 300-600 lines, and when it is full, it begins to delete the old results as you add new results. You can increase the amount of memory allocated to the Stata Results window (see �set scrollbufsize� in Section 3), but even this will probably not be enough for a long session with Stata. Thus, we need to use log to save the output. There are four ways to control the log operations.

1. You can use the log button on the tool bar. It looks like a scroll. 2. You can click on File/Log to get four options: Begin (log using), Close, Suspend (log off),

and resume (log on). 3. You can use �log� commands in the Stata Command window 4. You can use �log� commands in the Stata Do-file Editor.

In this section, we describe the commands, which can be used in the Stata Command window or in a do-file (program). log using This command creates a file with a copy of all the commands and output from Stata. The first time you open a log, you must give a name to the new file to be created. The syntax is: log using filename [, append replace [ text | smcl ] ] where filename is that name you give the new file. The options are:


N. Minot Page 1-22

append adds the output to an existing file replace replaces an existing file with the output text tells Stata to create the log file in text (ASCII) format smcl tells Stata to create the log file in SMCL format

Here are some examples: log using temp22 saves output to a file called temp22 log using temp20, replace saves output to an existing file, temp20, replacing content log using regoutput, append saves output to an existing file, results,adding to contents log using �d:\my data\myfile.txt� saves output in specified file in specified folder Several points should be remembered in using this command:

• if you use an existing file name but do not say �replace� or �append�, Stata will give an error message that the file already exists

• log files in text format can be opened with Wordpad, Notepad, the DOS editor, or any word processor., but the file does not have any formatting

• smcl files have formatting (bold, colors, etc) but can only be opened with Stata • smcl format is the default

log off This command temporarily turns off the logging of output, so that any subsequent output is not copied to the log file. This is useful if you want to save some of the output but not all. �Log off� only works after a �log using command.� log on This command is used to restart the logging, copying any new output to the log file that was already defined. �Log on� only works after a �log using� and a �log off� command. log close This command is used to turn off the logging and save the file. How are �log off� and �log close� different? �Log off� allows you to turn it back on easily with �log on,� continuing to use the same log file. After a �log close� however, the only way to start logging again is with �log using.� set logtype text This command tells Stata to always save the log files in text (ASCII) format. It is the same as adding the �text� subcommand to every �log using� command, but it is easier. If you prefer text format log files (as I do), this is the best way to make sure all the log files are in this format. set logtype smcl This command tells Stata to always save log files in SMCL format. It is the same as adding the �smcl� subcommand to every �log using� command.

Example 6 shows how the log command can be used. First, the log is opened using the filename �temp1.� Since I did not specify a folder, it saved the file to the default folder which (in this case) was my descktop. The results from �tab urban98� are saved in the log file. Then the log is turned off, so the results of �sum hhsize� is not logged. Third, the log is turned on so the results from �sum age� are logged. Finally, the log is closed.


N. Minot Page 1-23

Example 6: Using �log� to save output

Using the output

The easiest way to look at a log file is with File/Log/View, but there are several other ways to do it. You can:

• type �view [filename]� in the Stata Command window • click on the Viewer button (it looks like an eye) and type�view [filename]� • if it is in text format, you can open the Stata Do-file Editor (Windows/Do-file Editor) and

open the log file with the Editor (File/Open) • if it is in text format, you can open Wordpad (Start/Programs/Accessories/WordPad) and then

open the log file WordPad (File/Open) To print output from the Stata Results window, you can click File/Print Results.

. log using temp1, text ---------------------------------------------------------------------------- log: D:\Documents and Settings\NICHOLAS\Desktop\temp1.log log type: text opened on: 2 Aug 2002, 12:58:52 . tab urban98 1:urban 98; | 0:rural 98 | Freq. Percent Cum. ------------+----------------------------------- Rural | 4269 71.16 71.16 Urban | 1730 28.84 100.00 ------------+----------------------------------- Total | 5999 100.00 . log off log: D:\Documents and Settings\NICHOLAS\Desktop\temp1.log log type: text paused on: 2 Aug 2002, 12:59:26 ----------------------------------------------------------------------------- . sum hhsize Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- hhsize | 5999 4.752292 1.954292 1 19 . log on ----------------------------------------------------------------------------- log: D:\Documents and Settings\NICHOLAS\Desktop\temp1.log log type: text resumed on: 2 Aug 2002, 12:59:48 . sum age Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- age | 5999 48.01284 13.7702 16 95 . log close log: D:\Documents and Settings\NICHOLAS\Desktop\temp1.log log type: text closed on: 2 Aug 2002, 13:00:00 ---------------------------------------------------------------------------


N. Minot Page 1-24

To print output from a log file,

1) Open the log file with Stata Viewer (File/Log/View) 2) Click on File/Print Viewer

Unfortunately, it is not easy to copy Stata output to other software such as word processors and spreadsheets. It is best to copy tables from the Stata Viewer or from the Stata Results window using Edit/Copy Table. To move tables from a log file to an Excel table,

1) Open thelog file with Stata Viewer (File/Log/View) 2) Copy the table with Edit/Table Copy or Control-Shift C 3) Paste the table into Excel

To move tables from a log file to a Word table,

1) Open thelog file with Stata Viewer (File/Log/View) 2) Copy the table with Edit/Table Copy or Control-Shift C 3) Paste the table into Word with Control-V 4) Mark the table and then click Table/Insert/Table

To move tables from the Stata Results window to Word or Excel, follow the above procedures starting with step #2. However, one problem with these procedures is that there has to be a clear division between columns. If there is a heading that overlaps two columns, the two columns will be merged. To avoid this, you can exclude the heading when you copy the table. Exercises for logging 1) Use the file hhexp98n and open a log file called �results� to save output. Then do a frequency table of region by urban. Close the log file. 2) Copy the table into a Excel. 3) Copy the table into a Word table.


N. Minot Page 1-25

SECTION 5: CREATING NEW VARIABLES In the previous sections, we described how to explore the data using existing variables. In this section, we discuss how to create new variables. When new variables are created, they are in memory and they will appear in the Data Browser, but they will not be saved on the hard-disk unless you use the save command. In this section, we will cover the following commands and options.

generate replace tab �, generate operators functions recode xtile

generate This command is used to create a new variable. It is similar to �compute� in SPSS. The syntax is; generate newvar = exp [if exp] where �exp� is an expression like �price*quant� or �1000*kg�. Several points about this command: :

• Unlike �compute� in SPSS, generate cannot be used to change the definition of an existing variable. If you want to change an existing variable, you need to use �replace,�

• You can use �gen� as an abbreviation for �generate� • If the expression is an equality or inequality, the variable will take the values 0 if the

expression is false and 1 if it is true • If you use �if�, the new variable will have missing values when the �if� statement is false

For example,

generate age2 = age*age create age squared variable . gen yield = quant/area if area>0 create new yield variable if area is positive gen price = value/quant if quant>0 create new price variable if quant is positive gen highprice = (price>1000) creates a dummy variable equal to 1 for high prices replace This command is used to change the definition of an existing variable. The syntax is the same: replace oldvar = exp [if exp] [in exp] Some points to remember:

• Replace cannot be used to create a new variable. Stata will give an error message if the variable does not exist.

• There is no abbreviation for �replace.� Stata wants to make sure you really want to change the variable.

• If you use the �if� option, then the old values will be retained when the �if� statement is false • You can use the period (.) to represent missing values For example,


N. Minot Page 1-26

replace price = avgprice if price > 100000 replaces high values with an average price replace income =. if income<=0 replace negative income with missing value replace age = 25 in 1007 replace age=25 in observation #1007

tabulate � generate This command is useful for creating a set of dummy variables (variables with a value of 0 or 1) depending on the value of an existing categorical variable. The syntax is: tabulate oldvariable, generate(newvariable) The old variable is a categorical (or discrete) variable. The new variables will take the form newvariable1, newvariable2, newvariable3, etc. Newvariablex will be equal to 1 if oldvariable=x and 0 otherwise. It is easier to explain with an example. Reg8 is a variable that takes values of 1-8 for the different regions of Vietnam. We can create eight dummy variables as follows:

tab reg7, gen(region) This creates 8 new variables:

region1=1 if reg8=1 and 0 otherwise region2 =1 if reg8=2 and 0 otherwise � region8=1 if reg8=8 and 0 otherwise

In Example 7, notice that there are 1175 households in region 1 (Red River Delta) and the same number of households for which with region1=1. Example 7. Using �tab�, gen� to create dummy variables

. tab reg8, gen(region) Code by 8 | regions | Freq. Percent Cum. ------------+----------------------------------- 1 | 1175 19.59 19.59 2 | 731 12.19 31.77 3 | 128 2.13 33.91 4 | 708 11.80 45.71 5 | 628 10.47 56.18 6 | 276 4.60 60.78 7 | 1241 20.69 81.46 8 | 1112 18.54 100.00 ------------+----------------------------------- Total | 5999 100.00 . tab region1 reg8== | 1.0000 | Freq. Percent Cum. ------------+----------------------------------- 0 | 4824 80.41 80.41 1 | 1175 19.59 100.00 ------------+----------------------------------- Total | 5999 100.00


N. Minot Page 1-27

egen

This is an extended version of �generate� to create a new variable by aggregating the existing data. It is a powerful and useful command that does not exist in SPSS. To do the same thing in SPSS, you would need to create a new file with �aggregate� and merge it with the original file using �match files.� The syntax is:

egen newvar = fcn(arguments) [if exp] [in range] , by(var) where

newvar is the new variable to be created fcn is one of numerous functions such as:

count( ) max( ) min( ) mean( ) median( ) rank( ) sd( ) sum( )

argument is normally just a variable var in the by() subcommand must be a categorical variable Suppose you want to estimate the demand for rice using household data. You calculate a price variable using household expenditure data, but some households do not buy rice. You can replace the missing values with provincial average prices as follows:

egen avgprice = mean(price), by(province) Here are some other examples:

egen avg = mean(yield) creates variable of average yield over entire sample egen avg2 = median(income), by(sex) creates variable of median income for each sex egen regprod = sum(prod), by(region) creates variable of total production for each region

Example 8: Using egen to calculate averages

. egen avgexp = mean(rlpcex2), by(vill) . gen aboveavg = (rlpcex2>avgexp) . list househol vill rlpcex2 avgexp aboveavg in 40/50 househol vill rlpcex2 avgexp aboveavg 40. 305 3 7858.862 6643.441 1 41. 315 3 13006.72 6643.441 1 42. 306 3 3787.546 6643.441 0 43. 301 3 12084.1 6643.441 1 44. 310 3 4785.421 6643.441 0 45. 311 3 6666.962 6643.441 1 46. 405 4 6583.107 8231.103 0 47. 409 4 14452.78 8231.103 1 48. 407 4 3549.75 8231.103 0 49. 403 4 4145.278 8231.103 0 50. 401 4 6454.877 8231.103 0


N. Minot Page 1-28

In Example 8, we want to know which households have per capita expenditure (rlpcex2) above the village average. First, we calculate the average expenditure for each village with the �egen� command. Then we create a dummy variable based on the expression (rlpcex2 > avgexp). The list output shows how the village average is repeated for every household in the village and confirms that the dummy variable is correctly calculated.

operators This is not a Stata command, but a topic related to creating new variables. Most of the operators are obvious, but some are not. Unlike SPSS, you cannot use words like �or�, �and�, �eq�, or �gt�.

Arithmetic + addition - subtraction * multiplication / division ^ power Relational > greater than < less than >= more than or equal <= less than or equal == equal ~= not equal != not equal

Logical ~ not | or & and

The most difficult rule to remember is when to use = and when to use ==.

• Use a single equal symbol (=) when defining a variable. • Use a double equal symbol (==) when you are testing an equality, such as in an �if� statement

and when creating a dummy variable. Here are some examples to illustrate the use of these operators. Suppose you want you create a dummy variable indicating households in the Red River Delta. One way is to write: generate RRD = 0 replace RRD = 1 if reg8==1 Or you can get exactly the same result with just one command:

generate RRD = (reg8==1)

If the expression in parentheses is true, the value is set to 1. If it is false, the value is 0. Logical operators are useful if you want to impose more than one condition. For example, suppose you want to create a dummy variable for farmers in the Red River Delta. In other words, a household must be both in the Red River Delta and be a farmer to be selected.


N. Minot Page 1-29

gen RRDfarm = 0 gen RRDfarm = 1 if reg8==1 & farm==1 or an easier way to do this would be: gen RRDfarm = (reg8==1 & farm==1) Or suppose you wanted to create a dummy variable for households in the two deltas. This means a household can be in the Red River Delta or it can be in the Mekong River Delta to be selected. This variable can be created with: gen delta = 0 replace delta = 1 if reg8==1 | reg8==8 or by one command: gen delta = (reg8==1 | reg8==8) You can also combine conditions using parentheses. Suppose you wanted a dummy variable that indicates if a household is a poor farmer in one of the deltas. We will define poor as in the bottom 20 percent and use the variable quint98. gen PDF = ((reg8==1 | reg8==8) & farm==1 & quint98 ==1) functions Again, this is not a command, but a topic that is related to creating new variables. Here is a list of some of the more commonly-used functions. Other functions can be found by typing �help functions� in the Stata Command window.

abs(x) computes the absolute value of x exp(x) calculates e to the x power. ln(x) computes the natural logarithm of x log(x) is a synonym for ln(x), the natural logarithm. log10(x) computes the log base 10 of x. sqrt(x) computes the square root of x. invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z. normden(z) provides the standard normal density. normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not missing, otherwise, the result is missing. norm(z) provides the cumulative standard normal. group(x) creates a categorical variable that divides the data into x as nearly equal-sized subsamples as possible, numbering the first group 1, the second group 2, etc. It uses the current order of the data. int(x) gives the integer obtained by truncating x. round(x,y) gives x rounded into units of y.


N. Minot Page 1-30

recode This command changes the values of a categorical variable according to the rules specified. It is like the �recode� command in SPSS except that in Stata you do not use parentheses. The syntax is: recode varname old=new old=new � [if exp] [in range] Here are some examples:

recode x 1=2 changes all values of x=1 to x= 2 recode x 1=2 3=4 changes 1 to 2 and 3 to 4 recode x 1=2 2=1 exchanges the values 1 and 2 in x recode x 1=2 *=3 changes 1 in x to 2 and all other values to 3 recode x 1/5=2 changes 1 through 5 in x to 2 recode x 1 3 4 5 = 6 changes 1, 3, 4 and 5 to 6 recode x .=9 changes missing to 9 recode x 9=. changes 9 to missing

Notice that you can use some special symbols in the rules: * means all other values

. means missing values x/y means all values from x to y x y means x and y

In Example 9, we create a new variable that indicates whether a household lives in the north, center, or south of Vietnam, using the reg8 variable. Example 9. Using recode to define a new variable

. tab reg8 Code by 8 | regions | Freq. Percent Cum. ------------+----------------------------------- 1 | 1175 19.59 19.59 2 | 731 12.19 31.77 3 | 128 2.13 33.91 4 | 708 11.80 45.71 5 | 628 10.47 56.18 6 | 276 4.60 60.78 7 | 1241 20.69 81.46 8 | 1112 18.54 100.00 ------------+----------------------------------- Total | 5999 100.00 . gen reg3 = reg8 . recode reg3 1/3 =1 4/6=2 7/8=3 (4824 changes made) . tab reg3 reg3 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2034 33.91 33.91 2 | 1612 26.87 60.78 3 | 2353 39.22 100.00 ------------+----------------------------------- Total | 5999 100.00


N. Minot Page 1-31

xtile This command creates a new variable that indicates which category a record falls into, when the sample is sorted by an existing variable and divided into n groups of equal size. It is probably easier to explain with examples. xtile can be used to create a variable that indicates which income quintile a household belongs to, which decile in terms of farm size, or which tercile in terms of coffee production. The syntax is:

xtile newvar = variable [if exp] [in range] , nq(#) where

newvar is the new categorical variable created variable is the existing variable used to create the quantile (e.g income, farm size) # is the number of different categories (eg 5 for quintiles, 3 for terciles)

For example,

pctile incquint = income, nq(5) pctile farmdec = farmsize, nq(10) pctile coffeeter = coffarea, nq(3)

Suppose we want to create a variable indicating the tercile of rice expenditure per capita. Example 10. Using xtile to create categories

. gen ricepc = ricexpd/hhsize . xtile riceterc = ricepc, nq(3) . tab riceterc 3 quantiles | of ricepc | Freq. Percent Cum. ------------+----------------------------------- 1 | 2000 33.34 33.34 2 | 2003 33.39 66.73 3 | 1996 33.27 100.00 ------------+----------------------------------- Total | 5999 100.00 . tab riceterc farm, col nof 3 | Type of HH (1:farm; quantiles | 0:nonfarm) of ricepc | non farm farm | Total -----------+----------------------+---------- 1 | 46.70 23.39 | 33.34 2 | 31.47 34.82 | 33.39 3 | 21.83 41.80 | 33.27 -----------+----------------------+---------- Total | 100.00 100.00 | 100.00


N. Minot Page 1-32

Exercises for generating new variables 1) Use the file hhexp98n. Create a variable called �reg2� which indicates whether a household is in the north or the south of Vietnam based on reg8. Then do a frequency table of the new variable. 2) Using the same file, create a variable called �hhquint� that indicates the quintile of household size. Then do a frequency table on the new variable. 3) Using the same file, create a dummy variable called �rurfarm� that is equal to 1 if the household is a rural farm household and 0 otherwise Create another variabled called �upland� that is 1 if the household is in the Northwest, Northeast, or Central Highands. 4) Create a new variable �avgexp� which is equal to the regional average of expenditure (rlpcex2) (hint: use egen). Then calculate a new variable equal to the difference between the household expenditure and the regional average expenditure. 5) Use the file sco01a2. Create a variable �hhisze� which is equal to the total number of household members. (use egen) 6) Using the same file, create a new variable �notmarry� which is 1 if the person is single, divorced, or separated and 0 otherwise. 7) Create a set of dummy variables called �relatxx� based on the relationship of the person to the household head. For example, relat01 is a dummy for being the head, relat02 is a dummy for being the spouse, relat03 for a child, and so on. (hint: use tab�gen)


N. Minot Page 1-33

SECTION 6: MAKING TABLES TO DESCRIBE DATA In Section 3, we described some basic commands for exploring data. In this section, we introduce some more powerful and flexible commands for generating results from survey data. We begin with an explanation of how to label data in Stata. Then we describe three commands for generating tables. Finally, we will describe the use of sampling weights in analyzing survey data. These are the topics and commands covered in this section:

label variable label define label values #delimit tabulate � summarize tabstat table using weights

label variable This command is used to attach labels to variables in order to make the output easier to understand. For example, we know that reg8 indicates the number of the region where a household lives and that rlpcex2 means real per capita expenditure. But other people using our tables may not know this. So we may want to label the variables as follows:

label variable reg7 Region label variable rlpcex2 �Per capita expenditure�

• You can use the abbreviation �label var� • If there are spaces in the label, you must use double quotation marks. • If there are no spaces, quotation marks are optional. • This command is like �variable label� in SPSS except that you can only label one variable per

command and Stata uses double quotation marks, not single • The limit is 80 characters for a label, but any labels over 30 characters will probably not look

good in a table. label define This command gives a name to a set of value labels. For example, instead of numbering the regions, we can assign a label to each region. Instead of numbering the different sources of water, we can give them labels. The syntax is: label define lblname # "label" # "label" # �label� [, add modify] where lblname is the name given to the set of value labels # are the value numbers �label� are the value labels add means that you want to add these value labels to the existing set modify means that you want to change these values in the existing set Note that:

• You can use the abbreviation �label def� • The double quotation marks are only necessary if there are spaces in the labels


N. Minot Page 1-34

• Stata will not let you define an existing label unless you say �modify� or �add� • This command is similar to �value label� in SPSS except that in Stata you give the labels a

name and later attach it to the variable, while in SPSS you attach it to the variable in the same command.

label values

This command attaches named set of value labels to a categorical variable. The syntax is:

label values varname lblname where

varname is the categorical variable which will get the labels lblname is a set of labels that have already been defined by label define

Here are some examples of labeling values in Stata. . label variable yield "Yield (tons/hectare)" gives label to variable yield . label define yesno 0 no 1 yes defines set of labels called yesno . label values electricity yesno attaches those labels to variable called electricity . label define yesno 3 "perhaps", add adds new value label to existing set . label define yesno 3 "maybe", modify modifies existing value label . label define reglbl 1 RRD 2 NW 3 NE 4 NCC 5 SCC 6 CH 7NES 8 MRD . label values reg8 reglbl . label define reglbl 7 �Southeast� 8 �Mekong Delta�, modify Some additional commands that may be useful in labeling label dir to request a list of existing label names label list to request a list of all the existing value labels label drop to delete a one or more labels label save using to save label definitions as a Do-file label data to give a label to a data file More information is available by typing �help label� in the Stata Command window. Example 11 shows a frequency table with and without labels. The first table has no labels. Then a label var command is used to define the label �Region�, a label define command creates a set of labels, and label values attaches those labels to the reg8 variable. The second table has both the variable label (in the upper left corner of the table) and the labels for the regions. Finally, we show how a label list can be used to give the labels assigned to a label name.


N. Minot Page 1-35

Example 11. Using label to make tables more readable

#delimit In Example 11, you may have noticed that the region labels were too long to fit on one line. This is inconvenient when you are writing the command because, whether you are in the Do-file Editor or the Stata Command window, you have to scroll over to read the end of the command. The #delimit command solves this problem by allowing you to change the symbol used to indicate the end of the command. The default is a hard-return, called �cr� by Stata. The alternative is the semi-colon. #delimit ; makes the semi-colon the indicator of the end of the command #delimit cr makes the hard-return the indicator of the end of the command

. tab reg8 reg8 | Freq. Percent Cum. ------------+----------------------------------- 1 | 1175 19.59 19.59 2 | 731 12.19 31.77 3 | 128 2.13 33.91 4 | 708 11.80 45.71 5 | 628 10.47 56.18 6 | 276 4.60 60.78 7 | 1241 20.69 81.46 8 | 1112 18.54 100.00 ------------+----------------------------------- Total | 5999 100.00 . label var reg8 Region . label define reglbl 1 "Red River Delta" 2 "Northwest" 3 "Northeast" 4 "N.C > Coast" 5 "S.C. Coast" 6 "Central Highlands" 7 "Southeast" 8 "Mekong Delta" . label values reg8 reg8lbl . tab reg8 Region | Freq. Percent Cum. ------------------+----------------------------------- Red River Delta | 1175 19.59 19.59 Northwest | 731 12.19 31.77 Northeast | 128 2.13 33.91 N.C Coast | 708 11.80 45.71 S.C. Coast | 628 10.47 56.18 Central Highlands | 276 4.60 60.78 Southeast | 1241 20.69 81.46 Mekong Delta | 1112 18.54 100.00 ------------------+----------------------------------- Total | 5999 100.00 . lab list reglbl reglbl: 1 Red River Delta 2 Northwest 3 Northeast 4 N.C Coast 5 S.C. Coast 6 Central Highlands 7 Southeast 8 Mekong Delta


N. Minot Page 1-36

Some facts about #delimit: • It can only be used in a Do-file. It does not work in the Stata Command window. • The semi-colon is useful if you have long commands • The hard-return is more convenient if you have short commands

For example, the regional labels could be entered like this; label var reg7 �Region� #delimit ; label def reglb 1 �North Uplands�

2 �Red River Delta� 3 �NC Coast� 4 �SC Coast� 5 �Central Highlands� 6 �Southeast� 7 �Mekong Delta� ;

#delimit cr lab val reg7 reglb An alternative way of dealing with long lines is: label def reglb 1 �North Uplands� /*

*/ 2 �Red River Delta� /* */ 3 �NC Coast� /* */ 4 �SC Coast� /* */ 5 �Central Highlands� /* */ 6 �Southeast� /* */ 7 �Mekong Delta�

The #delimit command and the /* symbols can be used with any command, but they are often used with value labels. tabulate � summarize This command creates one- and two-way tables that summarize continuous variables. The command tabulate by itself gives frequencies and percentages in each cell (cross-tabulations). With the �summarize� option, we can put means and other statistics of a continous variable. The syntax is: tabulate varname1 varname2 [if exp] [in range], summarize(varname3) options where

varname1 is a categorical row variable varname2 is a categorical column variable (optional) varname3 is the continuous variable summarized in each cell options can be used to tell Stata which statistics you want

Some notes regarding this command: • The default statistics are the mean, the standard deviation, and the frequency. • You can specify which statistics with options �means� �standard� and �freq� • You can use the abbreviation �tab�sum( )� • This command is similar to the Stata command �by var3: sum var3� except that the

�tab�sum� output is more attractive and �tab�sum� allows two categorical variables • This command is also similar to the SPSS command �means var3 by var1�


N. Minot Page 1-37

Some examples: tab reg8, sum(rlpcex2) gives the mean, std deviation, and frequency of per capita

expenditure for each region tab urban98, sum(hhsize) mean gives the mean household size for urban and rural

households tab farm urban98, sum(food) gives the mean, std deviation, and frequency in each cell

of a 2x2 table of farmers/nonfarmer and urban/rural

In Example 12, we give the output for three �tab�sum� commands.

• The first table is a one-way table (just one categorical variable) showing the mean, standard deviation, and frequency of per capita expenditure for each expenditure quintile.

• In the second table, we use the �mean� option so only mean per capita expenditure is shown. • In the third table, we add a second categorical variable (urban98) making it a two-way table.

Although we could have requested all the the default statistics in the two-way table, it makes the table difficult to read so we do not advise it.

Example 12: Using tab�sum

tab quint98, sum(rlpcex2) | Summary of Expenditure per capita | exp quint | Mean Std. Dev. Freq. ------------+------------------------------------ 1 | 1180.4918 256.92477 917 2 | 1738.2957 163.67476 1012 3 | 2248.0415 209.72735 1158 4 | 3080.1508 357.28157 1316 5 | 6571.6628 3597.4445 1596 ------------+------------------------------------ Total | 3331.6804 2768.0741 5999 . tab quint98, sum(rlpcex2) mean | Summary of | Expenditure | per capita quint | Mean ------------+------------ 1 | 1180.4918 2 | 1738.2957 3 | 2248.0415 4 | 3080.1508 5 | 6571.6628 ------------+------------ Total | 3331.6804 . tab quint98 urban98, sum(rlpcex2) mean Means of B.M&Reg price adj. pc exp | 1:urban 98; 0:rural | 98 quint | Rural Urban | Total -----------+----------------------+---------- 1 | 1175.8359 1279.9701 | 1180.4918 2 | 1738.5241 1735.8682 | 1738.2957 3 | 2242.8277 2278.9809 | 2248.0415 4 | 3056.2124 3138.038 | 3080.1508 5 | 5260.4178 7253.5102 | 6571.6628 -----------+----------------------+---------- Total | 2477.9412 5438.3927 | 3331.6804


N. Minot Page 1-38

tabstat This command gives summary statistics for a set of continuous variable for each value of a categorical variable. The syntax is: tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname) where

varlist is a list of continuous variables statname is a type of statistic varname is a categorical variable

Some facts about this command:

• The default statistic is the mean. • Optional statistics subcommands include mean, sum, max, min, range, sd (standard

deviation), var (variance), skewness, kurtosis, median, and pn (nth percentile). • Without the by() option, tabstat is like �summarize� except that it allows you to specify the

list of statistics to be displayed. • With the by() option, tabstat is like "tabulate � summarize�except that tabstat is more

flexible in the statistics and format • It is very similar to the SPSS command �means�.

Examples tabstat farmsize hhsize, stats(mean max min) gives mean, max, and min of farmsize &

hhsize three variables tabstat farmsize hhsize, by(reg8) gives mean of two variables for each region tabstat farmsize, stats(median) by(reg8) gives the median farmsize for each region

Example 13. Using tabstat to create tables

. tabstat rlpcex2, stats(p25 p50 p75 mean) by(reg8) Summary for variables: rlpcex2 by categories of: reg8 (Region) reg8 | p25 p50 p75 mean -----------------+---------------------------------------- Red River Delta | 1874.237 2583.79 3919.563 3392.663 Northwest | 1460.617 1973.746 2963.513 2361.977 Northeast | 1234.18 1633.964 2143.89 1831.876 N.C Coast | 1568.988 2102.81 2906.817 2604.028 S.C. Coast | 1813.223 2485.993 3610.65 3098.375 Central Highland | 1158.759 1818.21 2784.002 2114.033 Southeast | 2553.454 3851.385 6213.039 5034.884 Mekong Delta | 1795.612 2474.845 3586.993 3073.823 -----------------+---------------------------------------- Total | 1772.714 2523.891 3866.33 3331.68 ----------------------------------------------------------


N. Minot Page 1-39

table This command creates a wide variety of tables. It is probably the most flexible and useful of all the table commands in Stata. The syntax is: table rowvar colvar [if exp] [in range], c(clist) [row col] where

rowvar is the categorical row variable colvar is the categorical column variable clist is a list of statistic and variables row is an option to include a summary row col is an option to include a summary column

Some useful facts about this command:

• The default statistic is the frequency. • Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min, median, and pn

(nth percentile). • The c( ) is short for contents of each cell. • Like tab, it can be used to create one- and two-way frequency tables, but table cannot do

percentages • Like tab�sum, it can be used to calculate basic stats for each value of a categorical variable • Its advantage over tab�sum is that it can do more statistics and it can take more than one

continious variable • Like tabstat, it can be used to calculate advanced stats for each value of a categorical variable • Its advantage over tabstat is that it can use do two (and more) way tables, but its disadvantage

is that it has fewer statistics. • It is similar to �table� in SPSS, but easier to learn and less flexible in formatting

Here are some examples: . table reg8 , row table of frequencies by region with total row . table reg8, c(mean income) table of average income by region . table regi8, c(mean yield sd yield median yield) table of yield statistics by region . table reg8, c(mean yield) format(%9.2f) table of average yields by region with format . table reg8 sex, c(mean yield) table of average yield by region and sex . table reg8 sex, c(mean income mean yield) table of avg yield & income by region & sex Some output from table commands is shown in Example 14. The first table is a two-way table of average household size by region and urban/rural. The second table is the same except that the format option has been added to reduce the size of the numbers. The option format(%4.1f) means fixed format with 4 digits and one to the left of the decimal point. The fourth table gives the average per capita expenditure for urban and rural households in each region (the sample did not include any urban areas in the Central Highlands). It uses a format(%6.0f) which expresses expenditure as an integer. Also note that it has a summary column, but no summary row. Usually, in a two-way table, it is useful to have both row and column summaries.


N. Minot Page 1-41

weights What are sampling weights? Sampling weights are used to compensate for under- or over-representing certain households in a sample and to allow extrapolation of the sample results to the population. Let�s take a simple example:

• Suppose you wanted to estimate the total population of Hanoi by interviewing randomly 25% of the households. In your sample, there are h households and H people. Your estimate of the total population of Hanoi would be 4*h.

• Similarly, if you interview 10% of the households in Da Nang and find d households and D people, your estimate of the population of Da Nang would be 10*D.

• If you want an estimate of the population of the two cities together, you would calculate 4*H+10*D.

• If you wanted to estimate the average household size of the two cities, you would have to divide the estimated total population (4*H+10*D) by the estimated total number of households (4*h+10*d).

The basic principle is that the sampling weight is the inverse of the probability of selection. Because of clustering and sampling, virtually all random-sample surveys must use weights to make estimates that are valid for the whole population. Furthermore, the calculation of sums, averages, and percentages must take into account the sampling weights. Sampling weights in the VLSS The calculation of the sampling weights in the VLSS is much more complicated than the example given above, but the principle is the same. The GSO estimated the probability that each household would be selected and then calculated the sampling weight as the inverse of that probability. In the VLSS, the sampling weight is in hhexp98n.dta and the variable name is �wt� 2. We can use the table command to generate some statistics about the VLSS weights. . table reg8 urban98, c(mean wt) row col format(%7.0f) --------------------------------------- | 1:urban 98; 0:rural | 98 Region | Rural Urban Total ------------------+-------------------- Red River Delta | 3482 2439 3134 Northwest | 3441 2330 3206 Northeast | 3692 2087 3291 N.C Coast | 3434 1803 3185 S.C. Coast | 2281 1841 2127 Central Highlands | 1320 1320 Southeast | 1806 2208 1981 Mekong Delta | 3093 2482 2938 | Total | 2869 2242 2689 --------------------------------------- The average weight is 2688, meaning that household in the VLSS sample represents (on average) 2688 households in Vietnam. The new 2001 Vietnam Household Living Standards Survey will have a sample of about 75,000, so the weights will be much smaller, probably around 230.

2 This weight is used for calculating averages in which every household has equal weight. Sometimes, we want to give each person an equal weight, such as when we want to calculate the percentage of people that are in households below the poverty line. For these calculations, it is better to use the variable wthhsize as a weight. This variable is simply wt*hhsize.


N. Minot Page 1-42

Using sampling weights in Stata The calculation of weighted sums and weighted averages would be very tedious, but fortunately survey software such as SPSS and Stata do this for us. In SPSS, you turn on the weights and weights are used in all calculations until you turn it off. Stata is different in that you tell Stata which commands should use weights. Stata allows four kinds of weights:

1) fweights, or frequency weights, are weights that indicate the number of duplicated observations.

2) pweights, or sampling weights, are weights that denote the inverse of the probability that the observation is included due to the sampling design.

3) aweights, or analytic weights, are weights that are inversely proportional to the variance of an observation;

4) iweights, or importance weights, are weights that indicate the "importance" of the observation in some vague sense.

Here we will focus on pweights and fweights3. The syntax for using weights is:

command ... [weighttype=varname] ... In the case of the VLSS, we will generally be using the following syntax:

command � [pw=wt] � Here are some examples:

tab reg8 [fw=wt] gives the weighted frequencies in each region sum hhsize [fw=wt] gives the weighted mean household size tab sex [fw=wt], sum(rlpcex2) gives table of weighted mean expenditure by sex of

head of household tabstat hhsize [fw=wt], by(urban98) gives the weighted average household size for urban and rural households table reg8 [pw=wt], c(mean age) gives the weighted mean age of heads by region

Example 15 shows the effect of weights. The first table gives the unweighted percentage of urban and rural households in each department. In the second table, the weights are turned on. Notice that the urban households represent almost 29 percent of the sample but just 24 percent of population. This means that urban households were slightly over-represented in the original VLSS sample (you can verify in the table above that urban weights are slightly smaller). This also means that using the raw, unweighted results would give too much weight to urban households relative to their share of the population. The box also shows that weighted and unweighted means are different. The average household size is 4.75 without the weights and 4.70 with the weights. Notice that the number of observations in the second is 1.6 million. This represents the extrapolated number of households. Type �help weights� in the Stata Command window for more information.

3 For a number of commands, like tab, sum, and tabstat, Stata does not allow pweight, but fweight gives the correct percentages and means.


N. Minot Page 1-44

SECTION 7: PRESENTING DATA WITH GRAPHS

This section provides a brief introduction to creating graphs. In Stata, all graphs are made with the graph command, but there are 8 types of charts and numerous subcommands for controling the type and format of graph. In this section, we focus on four types of graph and a few options. These are the subcommands covered in this section: graph histogram twoway bar

pie matrix xlabel ylabel connect( ) symbol( )

graph This command generates numerous types of graphs and diagrams. The syntax is: graph [varlist] [if exp] [in range] , graphtype options where

varlist is the list of variables to graph graphtype is the type of graph options are commands to control the look of the graph

The eight graph types are: histogram Bar chart based on frequency oneway Scatterplot with one variable twoway Scatterplot with two variables matrix Matrix of two-way scatterplot graphs box Box-and-whisker plot star Star chart bar Bar chart of means or sums pie Pie chart There are too many options to describe here, but we describe how to make some of the more common graphs. The default graph type depends on the number of variables specified:

• The default graph type is histogram if only one variable is specified. • The default graph type is two-way scatterplot if two or more variables are specified.

Some options are common to many graph types: title(�text�) specifies the title to use on the graph b2(�text�) specifies title on X axis (b for bottom) l2(�text�) specifies title on Yaxis (l for left) xlabel uses �round� values to label x axis ylabel uses �round� values to label y axis by(var1) repeat graph for each value of var1 Some options for histograms:


N. Minot Page 1-45

bin(#) specifies that the histogram will have # bars freq label Y axis in terms of frequency percent label Y axis in terms of percent normal draws a normal curve with the means and SD of the variable Some options for two-way scatterplots: connect( ) to specify how points are connected

symbol( ) to specify what the marker look like

Some options for bar charts: means graphs means of variables given stack stack the bars for each variable rather than putting them side by side Here are some examples of the graph command:

graph x histogram of x graph x, bin(5) xlabel ylabel histogram with 5 bars and rounded axis labels graph y1 y2 x scatter plot of y1 and y2 against x graph y x, by(region) scatter plots of y against x for each region

graph a b c, bar graph sums of a, b, and c as bars graph a b c, bar means graph means of a, b, and c as bars Example 16 shows the result of the command graph ricexpd hhsize, xlabel ylable. It was inserted into Word by clicking Edit/Copy Graph in Stata and then Control-V in Word. Example 16. Two-way scatterplot graph


N. Minot Page 1-46

In Example 17, a histogram was created with the command: graph rlpcex2 if rlpcex2<20000, xlabel ylabel normal bin(20) Example 17. Histogram of per capita expenditure in Vietnam

In Example 18, the data were sorted by reg8, then the graph was created with: graph rlpcex2, bar means by(reg8) ylabel Example 18. Bar chart of per capita expenditure by region


N. Minot Page 1-47

SECTION 8: MODIFYING DATA FILES This section describes a number of commands that are used to modify and combine data files in Stata. We begin with a five simple commands and then move to five more complex ones.

rename drop keep sort compress

collapse merge

append reshape fillin rename This command renames variables. Some examples: rename oldname newname rename s1aq06y age drop This command deletes records or variables. Examples are: drop if age>140 deletes records in which age is greater than 140 drop if area==. deletes records in which area is missing drop temp1 temp2 deletes variables temp1 and temp2 keep This command deletes everything but specified observations or variables. Examples include:

keep if age <= 140 keeps only records in which age is 140 or under keep househol age rlpcex2 keeps only variables househol and rlpcex2, deleting others sort This command sorts the records in the file according to the value of specified variables. Examples are: sort reg8 househol sorts data file by reg8 and within each region by househol ID sort urban98 sorts by the dummy variable urban98 compress This command reduces the size of the file by changing the data storage types. It will not make any changes that would cause Stata to lose data. This command has no options or arguments.


N. Minot Page 1-48

collapse This command is used to create a new data file by aggregating the existing one. It allows you to change the level of the data file. Person-level data can be collapsed to the household level to calculate the size of the household. Crop �level data can be collapsed to the household-level to calculate the value of agricultural production per household. The syntax is: collapse (stat) varlist1 (stat) varilist2, by(varlist3) where

stat refers to one of the statistics varlist1 are the variables to be aggregated using the first statistic varlist2 are the variables to be aggregated using the second statistic varlist3 are the categorical variables which define the aggregation

Some points about the collapse command:

• The default statistic is mean • Optional statistics are mean, sum, rawsum, count, max, min, median, and pn (the nth

percentile, where n is between 1 and 100) • The output file will have one record for each value of varlist3 in the by( ) option • If no by( ) option is given, then the data will be collapse to one record • This is similar to �aggregate� in SPSS except Stata does not require you to define a new name

for the aggregated variable (by default, it uses the old variable name). Examples of the collapse command:

collapse age educ income, by(province) creates a dataset of provincial means ot age, education, and income

collapse (median) income, by(province) creates a dataset of provincial medians of income

collapse (mean) age (median) income, by(reg8) creates a dataset of regional means of age and regional medians of income

collapse (mean) age educ (median) income creates a dataset with overall means & medians

Example 19. Using collapse to calculate household size

. use scr01a2 . sum s1aq05y Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- s1aq05y | 28069 1969.846 19.95489 1899 1998 . collapse (count) idcode if s1aq11==1, by(househol) . gen hhsize = idcode . sum hhsize Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- hhsize | 6002 4.751583 1.95443 1 19


N. Minot Page 1-49

In Example 19, we use collapse to calculate the average household size from the person-level data. The first sum command shows that there are 28,069 records in the original person-level file. After the collapse, the second sum command indicates that there are just 6002 records. It also shows that the average household size (unweighted) is 4.75, the same figure we found in the hhexp98n file in Section 3. merge This command combines two files with different variables into one file. Until now, all the commands we have worked with used just one file. The VLSS has over one hundred files, however, and often we would like to combine data from differerent files. For example,

• to calculate expenditure we need to combine the files for food expenditure and non-food expenditure

• to calculate school attendance rates, we need to combine the file with age and the file with school attendance

• to examine the relationship between the value of the house and housing characteristics, we need to combine several files.

• to calculate the value of agricultural production, we need to combine the files for rice, other food crops, annual industrial crops, and permanent industrial crops.

Files can be combined vertically (top to bottom). In this case, the two files have different records and are linked by having the same variables. The files below have different records but the same varaibles. The first file has crops 1-10, while the second file has crops 11-20. They can be combined with append as described later. Two files before append One file after append hhid crop area quant value hhid crop area quant value 101 1 101 1 101 4 101 4 102 1 102 1 102 7 102 7 103 2 103 2 ══► 101 16 hhid crop area quant value 102 12 101 16 102 13 102 12 103 11 102 13 103 16 103 11 103 19 103 16 103 19 Files can be combined horizontally (side to side). In this case, the two files have different variables and are linked by having the same observations (person, household, crop, etc.) The files below have different variables but the same records (household). The command merge will combine records with the same household identification number (hhid). This would allow an analysis of how housing value (in the second file) varies according to expenditure quntile (in the first).


N. Minot Page 1-50

Two files before merge hhid region urban exppc quint farm hhid housetype water elect value 101 101 102 102 103 103 201 201 202 202 203 204 204 204

║ ║ ▼

One file after merge hhid region urban exppc quint farm housetype water elect value 101 102 103 201 202 203 204 The syntax for the merge command is: merge [varlist] using filename where varlist is the list of variables in common filename is the data file that the current data set will with merged with Some notes about the merge command:

• Both the original file and the new file must be sorted by the common variable(s) before merging

• A variable called _merge is create which indicates the source of each record. _merge=1 means it is from the original data set only _merge=2 means it is from the new data set only _merge=3 means it is from both data sets.

• It is a good idea to run a �tab _merge� command after every merge to check the merger. • The merge command in Stata is similar to the �match files� command in SPSS.

Some examples: use members opens file �members� merge hhid perid using educ merges files �members� with �educ� with hhid and

perid as the common variables use hhchar opens file �hhchar� merge housing using hhid merges �hhchar� and �housing� using hhid as the common variable In Example 20, we merge the list of household members (scr01a2) with the education file (scr02a). We open the household member file, rename some variables, delete others, and then sort. Next, we merge the member file and the education file. After renaming and dropping more variables, the des command shows that we haveage and sex from the first file and attend from the second.


N. Minot Page 1-51

Example 20. Using merge to calculate school attendance

use "D:\Vietnam Pov Mapping\Training\SCR01A2.DTA", clear . rename s1aq02 sex . rename s1aq06y age . keep househol idcode sex age . des Contains data from D:\Vietnam Pov Mapping\Training\SCR01A2.DTA obs: 28,633 vars: 4 16 Dec 1999 15:00 size: 343,596 (99.7% of memory free) ----------------------------------------------------------------------------- storage display value variable name type format label variable label ----------------------------------------------------------------------------- househol long %12.0g HOGIADINH idcode byte %8.0g MA HIEU: sex byte %8.0g 2. Gioi tinh : age int %8.0g 6. [TEN] bao nhieu..?SO NAM: ----------------------------------------------------------------------------- Sorted by: househol Note: dataset has changed since last saved . sort househol idcode . merge househol idcode using scr02a . gen attend=(s2aq03==1 | s2aq03==3)*100 . drop cluster-s2aq22 . des Contains data from D:\Vietnam Pov Mapping\Training\SCR01A2.DTA obs: 28,633 vars: 6 16 Dec 1999 15:00 size: 486,761 (99.5% of memory free) ----------------------------------------------------------------------------- storage display value variable name type format label variable label ----------------------------------------------------------------------------- househol long %12.0g HOGIADINH idcode byte %8.0g MA HIEU: sex byte %8.0g 2. Gioi tinh : age int %8.0g 6. [TEN] bao nhieu..?SO NAM: _merge byte %8.0g attend float %9.0g ---------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved . sort age . graph attend if age<25, bar means ylabel by(age)


N. Minot Page 1-52

The graph in Example 21 shows the percentage attending school for each age from 0 to 24.

Example 21. Graph of school attendance by age

append This command combines two files with different records but the same variables. The syntax is: append using filename where filename is the name of the file to be added to the current data set. This command is similar to �join files� in SPSS. In the VLSS, the append command is useful in analyzing household expenditure and agricultural production. For example, the agricultural production data is found in six files:

scr09b1 rice production scr09b2 other food production scr09b3 annual industrial crops scr09b4 permanent industrial crops scr09b5 fruit crops scr09b6 agro-forestry crops

In order to calculate the value of agricultural production, crop sales, or total income, it is necessary to combine these files. Since they have similar variables but refer to different observations (crops), we combine them with append. We will illustrate the method by combining the rice and other food files.


N. Minot Page 1-53

Because the variables are not quite the same, we need to rename the variables before combining the files. This will require more than 10 commands, so it is probably worth creating a Do-file by clicking on Window/Do-file Editor. In the file, we type the following commands:

use "D:\Vietnam Pov Mapping\Training\SCR09B1.DTA", clear rename s9b1cc crop rename s9b1q03 area rename s9b1q04 prod rename s9b1q061 saleq replace saleq = saleq*(1/.67) if s9b1q062==2 rename s9b1q071 buyer keep househol crop area prod saleq buyer des, short save riceprod use "D:\Vietnam Pov Mapping\Training\SCR09B2.DTA", clear rename s9b2cc crop rename s9b2q03 area rename s9b2q04 prod rename s9b2q06 saleq rename s9b2q071 buyer keep househol crop area prod saleq buyer des, short save foodprod use riceprod, clear append using foodprod save allfood des, short table crop, c(mean area mean prod mean saleq) format(%6.0f)

In Example 22 are selected results from the Stata Results window. The rice file (after modification) contained 8760 records and 6 variables. The other food file (after modification) contained 10,541 records and 6 variables. The combined file has 19,261 records (8720+10541) and 6 variables. Example 22. Using append to combine files

Contains data from D:\Vietnam Pov Mapping\Training\SCR09B1.DTA obs: 8,720 vars: 6 11 Jul 1999 16:13 size: 261,600 (99.7% of memory free) Sorted by: househol crop Note: dataset has changed since last saved Contains data from D:\Vietnam Pov Mapping\Training\SCR09B2.DTA obs: 10,541 vars: 6 11 Jul 1999 15:45 size: 316,230 (99.7% of memory free) Sorted by: househol Note: dataset has changed since last saved Contains data from allfood.dta obs: 19,261 vars: 6 4 Aug 2002 02:53 size: 654,874 (99.4% of memory free) Sorted by: Note: dataset has changed since last saved


N. Minot Page 1-54

fillin This command inserts additional records into a file so that all combinations of two or more variables are in the file. Again, it is easier to give an example than to describe it. Suppose we are working with crop production data. Data are collected on 5 crops, but most households only grow 2-4 of them. The records exist only for crops grown by the household, as shown below: File in original form

hhid crop area prod 1 1 3 3 1 3 1 1 1 5 1 1 2 1 1 1 2 5 1 1 3 1 2 2 3 4 1 1 3 5 4 4 If we calculate the average area for each crop, it will give the average area among those growing the crop. If we want the average area including the non-growers, it is not easy to calculate. Stata allows you to fill in the �missing� records of crops not grown by each household. The syntax is easy: fillin varlist where varlist is the list of variables, every combination of which we want to exist in the file. Using our example above, the command would be fillin hhid crop Stata will look for all the values of hhid and all the values of crop in the file, then it will make sure every hhid-crop combination has a record. When it has to insert record, the values of the other variables will be missing. The new file would look like this: File after fillin command

hhid crop area prod 1 1 3 3 1 2 . . 1 3 1 1 1 4 . . 1 5 1 1 2 1 1 1 2 2 . . 2 3 . . 2 4 . . 2 5 1 1 3 1 2 2 3 2 . . 3 3 . . 3 4 1 1 3 5 4 4


N. Minot Page 1-55

If we calculate the average area and production on this file, we will get the same answer as above. Because missing values are not counted, the result will be the average among growers. But if we replace the missing values with zeros: recode area .=0 recode prod .=0 then the averages will include the zeroes. This is an extremely useful command, particularly for dealing with crop data and expenditure data.. SPSS does not have a similar command. reshape The command changes a file from tall to wide or from wide to tall. What do we mean by �wide� and �tall�. A wide file stores additional information as separate variables, while a tall file stores this information using additional records. An example will be easier to understand. Suppose a household credit survey asks about the amount and source of the three most recent loans. One way to store this data is with a wide file, in which additional loans are stored in additional variables. File in �wide� format hhid amount1 source1 amount2 source2 amount3 source3 1 2 3 4 5 The other way to store the data is with a tall file, in which additional loans are stored as additional records.

File in �tall� format

hhid loannbr amount source 1 1 1 2 1 3 2 1 2 2 2 3 3 1 3 2 3 3 4 1 4 2 4 3 5 1 5 2 5 3 Notice that both files have the same number of data points (30) for loan amount and source of the loan; they are just arranged differently. The reshape command allows you to convert one type of file into the other. For more information, type �help reshape� in the Stata Command window. For information on how to implement reshape, type �help reshape.�


N. Minot Page 1-56

SECTION 9: REGRESSION ANALYSIS This section describes the use of Stata to do regression analysis. Regression analysis involves estimating an equation that best describes the data. One variable is considered the dependent variable, while the others are considered independent (or explanatory) variables. Stata is capable of many types of regression analysis and associated statistical test. In this section, we touch on only a few of the more common commands and procedures. The commands described in this section are:

regress test, testparm predict probit ovtest hettest

regress This command carries out a regression analysis on the variables specified. The syntax is: regress depvar varlist [if exp] [in range] [options] where

depvar is the dependent variable varlist is the list of independent variables

The regress command has many options for specifying the type and format of the output. Type �help regress� for more information. Some examples of the command: . regress y x1 x2 x3 x4 x5 regress y with x�s as independent variable . regress y x1 x2 x3 x4 x5 if region==1 same regression but only in one region . by region: regress y x1 x2 region* region* means all variables starting with region.. predict This command can be used to obtain predictions, residuals, etc., after regression analysis. predict newvarname [if exp] [in range] [, options] Two of the most common options are:

xb predicted values of y are put in newvarname e residuals of the regression are put in newvarname

For example: . regress y x1 x2 x3 . predict yhat, xb creates variable yhat with predicted values . predict e, resid creates variable e with residuals . probit poverty age sex housing . predict index, xb creates variable index with the value of sum of XB . predict phat creates variable phat with the predicted probability

Example 23 presents the results of a regression analysis of the determinants of rice expenditure. The results indicate that rice expenditure is greater in larger households headed by older males.


N. Minot Page 1-57

Example 23. Using regress to examine determinants of rice expenditure

Rice expenditure is positively related to per capita expenditure (though interestingly, the coefficient was negative if you exclude the urban dummy variable). Urban households consume significantly less rice than rural households, even after controling for other factors. Compared to the Central Highlands (region6), households in the Northeast and Red River Delta spend more on rice. Note that Stata automatically dropped one of the regional dummy variables to avoid perfect multicollinearity.

. use hhexp98n, clear . tab reg8, gen(region) Code by 8 | regions | Freq. Percent Cum. ------------+----------------------------------- 1 | 1175 19.59 19.59 2 | 731 12.19 31.77 3 | 128 2.13 33.91 4 | 708 11.80 45.71 5 | 628 10.47 56.18 6 | 276 4.60 60.78 7 | 1241 20.69 81.46 8 | 1112 18.54 100.00 ------------+----------------------------------- Total | 5999 100.00 . gen age2 = age^2 . regress ricexpd hhsize age age2 sex rlpcex2 educyr98 urban98 region* Source | SS df MS Number of obs = 5999 -------------+------------------------------ F( 14, 5984) = 652.17 Model | 4.7119e+09 14 336562752 Prob > F = 0.0000 Residual | 3.0881e+09 5984 516065.748 R-squared = 0.6041 -------------+------------------------------ Adj R-squared = 0.6032 Total | 7.8000e+09 5998 1300436.14 Root MSE = 718.38 ------------------------------------------------------------------------------ ricexpd | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- hhsize | 414.4978 5.358447 77.35 0.000 403.9933 425.0023 age | 51.19595 4.82956 10.60 0.000 41.72827 60.66363 age2 | -.4741431 .0472076 -10.04 0.000 -.5666869 -.3815993 sex | -109.7768 22.93277 -4.79 0.000 -154.7333 -64.82034 rlpcex2 | .0187889 .0043095 4.36 0.000 .0103408 .027237 educyr98 | -6.659207 2.670341 -2.49 0.013 -11.89404 -1.424376 urban98 | -453.6721 24.4138 -18.58 0.000 -501.532 -405.8123 region1 | 36.34716 49.75018 0.73 0.465 -61.18113 133.8754 region2 | 312.9143 51.49933 6.08 0.000 211.957 413.8715 region3 | 443.1319 77.56194 5.71 0.000 291.0825 595.1813 region4 | -21.75859 51.96463 -0.42 0.675 -123.628 80.11081 region5 | -147.9203 52.73708 -2.80 0.005 -251.304 -44.53664 region6 | (dropped) region7 | 37.15175 49.34054 0.75 0.451 -59.57349 133.877 region8 | 55.16549 48.84751 1.13 0.259 -40.59324 150.9242 _cons | -777.223 123.29 -6.30 0.000 -1018.916 -535.53 ------------------------------------------------------------------------------


N. Minot Page 1-58

test This command tests linear hypotheses about the estimated parameters from the most recently estimated model. For example, . regress y age female educ region1 region2 region3 region4 . test region1=region2 test hypothesis that region1 coef = region2 coef . test region4 = (region1+region2)/2 test hypothesis given by equation . test educ=.1 test hypothesis that educ = 0.1 . test region1 region2 region3 region4 test of hypothesis that four region dummies are zero If you want to test the hypothesis that a set of related variables are all equal to zero, you can use the related testparm command: . testparm region* test of hypothesis that all region* dummies are zero probit This command carries out a probit regression analysis of the specified variables. The syntax is: probit depvar indepvars [if exp] [in range] [, options] Probit analysis is used when the dependent variable is a categorical variable with only two values. An alternative is the dprobit command which reports the derivative of the probability with respect to each independent variable instead of the coefficient. Examples include: . probit y x1 x2 x3 run a probit with y as dependent and x�s as independent . probit x1 x2 x3, robust run a �robust� probit (weaker assumptions about error) . dprobit y x1 x2 x3 if reg8 ==1 run the probit in one region only ovtest Regression analyis generates the best unbiased linear estimates of the �true� coefficients provided that some assumptions are satisfied. One assumption is that there are no missing variables that are correlated with the error term. This command performs a Ramsey RESET to test for omitted variables (misspecification). The syntax is: ovtest [, rhs] This test amounts to estimating y = xb+zt+u and then testing t=0. If the rhs option is not specified, powers of the fitted values are used for z . Otherwis, the powers of the indiependent variables are used.. Examples of the test are: . regress y x1 x2 x3 . ovtest tests significance of powers of predicted y . ovtest, rhs tests significance of powers of x1, x2, and x3 hettest Another assumption behind regression analysis is that the variance of the error term is constant across the sample. When this assumption is violated, the problem is called heteroskedasticity. This command tests for heteroskedasticity. hettest [varlist]


N. Minot Page 1-59

This command tests t=0 in Var(e)=s^2exp(zt). If varlist is not specified, the fitted values are used for z. If varlist is specified, the variables specified are used for z. This test is also known as the Breusch-Pagan test for heteroskedasticity. Examples are: . regress y x1 x2 x3 . hettest test whether variance related to predicted y . hettest x3 test whether variance related to x3 Example 24 gives the result of some tests related to the regression analysis shown earlier. The test command tests the hypothesis that both age variables are zero, finding that the probability is very low (less than .0000) so we can reject this hypothesis. This is not surprising since each is statistically significant on it own. The parmtest command tests the hypothesis that all the region coefficient are equal to zero (that region does not influence rice expenditure). The hypothesis is rejected, meaning that the regional coefficients are jointly significant. The ovtest rejects the hypothesis that there are no omitted variables, indicating that we need to improve the specification (prices would be a good start). And finally, hettest indicates that there is heteroskedasticity which needs to be dealth with.

Example 24. Regression tests

. test age age2 ( 1) age = 0.0 ( 2) age2 = 0.0 F( 2, 5984) = 59.97 Prob > F = 0.0000 . testparm region* ( 1) region1 = 0.0 ( 2) region2 = 0.0 ( 3) region3 = 0.0 ( 4) region4 = 0.0 ( 5) region5 = 0.0 ( 6) region6 = 0.0 ( 7) region7 = 0.0 ( 8) region8 = 0.0 Constraint 6 dropped F( 7, 5984) = 26.88 Prob > F = 0.0000 . ovtest Ramsey RESET test using powers of the fitted values of ricexpd Ho: model has no omitted variables F(3, 5981) = 3.18 Prob > F = 0.0230 . hettest Cook-Weisberg test for heteroskedasticity using fitted values of ricexpd Ho: Constant variance chi2(1) = 1473.93 Prob > chi2 = 0.0000

test age age2


N. Minot Page 1-60

SECTION 10: INTRODUCTION TO PROGRAMMING WITH STATA This section provides a very quick introduction to the topic of programming with Stata. We touch on three topics:

• creating and using macros • creating and using loops • matrix algebra

The purpose here is not to provide a comprehensive description of how to program with Stata, but rather to give you an idea of the kinds of things that can be done with Stata. To fully describe Stata programming would require more space than is available here. Furthermore, I do not (yet) know enough about it to teach it. macros Macro assign a set of word or a number to a name. There are two types of macros.

• �Global� macros stay in memory until you leave Stata • �Local� macros exist only with a program or a loop

The syntax is relatively simple:

global gmname = � expression � local lmname = � expression �

To use these macros later, you must use special symbols to tell Stata they are macros:

$gmname `lmname�

One use of the global macro is to store the name of the folder with the data.

global path = �d:\data\vlss\1998\household� use �$path\scr09b2.dta�

In addition to saving you some time, this macro is useful if you share the program with others who have different names for the folders on their computer. By using the macro, your colleague can change the global command once rather than trying to change the path in every command that opens a file or saves a file. Local macros are used (among other places) in loops with the while command, so we will discuss them in the next section. while This command starts a loop, allowing groups of Stata commands to be repeated until some condition is met. The syntax is: while exp { commands } where exp is an expression. Stata repeats the commands as long as the expression is


N. Minot Page 1-61

true. commands are any Stata commands that you want to repeat brackets define the beginning and the ending of the commands to be repeated

This is an example of a loop that uses local macros to carry out a regression analysis of the determinants of housing value for each region: tab reg8, gen(region)

local r = 1 while `r' <= 8 { regress housval roof floor wall room area water if region`r' == 1 local r = `r' + 1 } The tab command creates a dummy variables for each region (region1, region2, etc). The first local command creates a macro called �r� that is equal to 1. The while statement says that the commands in brackets will be repeated until the condition r<=8 is no longer true. On each loop, the regress command is carried out in one region (when r=3, the if statement is �if region3==1). The second local command increases the value of r each time that the loop is completed. When r reaches 9, the loop stops because the while condition is no longer true. Then Stata goes on to the next command after the bracket. matrix Stata has a special set of commands for matrix algebra. These can be used to implement custom econometric procedures or for doing calculations on the output of regression analysis. This is a very short summary of a very long list of complex commands. complex set of commands (type �help matrix� for more information). 1. Creating matrices by hand Examples: matrix mymat = (1,2\3,4) commas separate elements, backslash indicates new row matrix myvec = (1 5 3 1 3) creates a row vector matrix mycol = (1/5/3/1/3) creates a column vector 2. Setting the maximum matrix size For regular Stata, the default maximum matrix size is 40x40, but this can be increased up to 800x800 with the matsize command. For Stata SE, the default maximum is 400x400, but this can be increased up to 11,000x11,000. The maximum matrix size can be changed using

set matsize 500 sets the maximum size for a matrix at 500x500 3. Manipulating matrices Examples: matrix D = B makes matrix D equal to matrix B matrix beta = syminv(X'*X)*X'*y calculates beta using regression equation matrix C = (C+C')/2 redefines C matrix in terms of old values matrix sub = A[1..., 2..5]/2 defines matrix using sub-set of A matrix matrix A[2,2] = B redefines subset of A matrix as equal to B


N. Minot Page 1-62

4. Converting variables into matrices and vice versa Variables can be converted into matrices and likewise matrices can be converted into variables. Type �help mkmat� for more information. 5. Using matrices created by Stata Some Stata commands create matrices which can be retrieved and used. For example, all the regression commands create the following:

e(b) coefficient vector e(V) variance-covariance matrix of the estimates And these matrices can be used as follows: matrix beta = e(b) creates a vector called beta with the estimated coefficients matrix cov = e(V) creates a matrix called cov with the estimated covariances 6. Accumulating cross-product matrices Most statistical computations involve matrix operations such as X'X or X'WX. In many cases, X may have a very large number of rows and a small number of columns. Stata has a special command for calculating cross-products in these cases. Type �help matacum� for more information. 15. Matrix utilities matrix dir lists the currently defined matrices matrix list displays the contents of a matrix matrix rename renames a matrix matrix drop deletes a matrix

STATA Training

Documents

Transcript of STATA Training