Key Data Management Tasks in Stata

Key Data Management Tasks in Stata

FHSS Research Support Centerfhssrsc.byu.edu

115 and 116 SWKT

Investigate Duplicates in the Data (1a.)

If you suspect that duplicates exist in your data, as in this example…

You can use duplicates report to investigate…

3 3 2 2 6 3 1 197 0 copies observations surplus

Duplicates in terms of all variables

. duplicates report

3 3 2 2 8 4 1 195 0 copies observations surplus

Duplicates in terms of id female ses

. duplicates report id female ses

Observations with 1, 2, or 3 copies

Most observations are unique

3 observations have 2 copies 1 observation has 3 copies

When the report is given in terms of only some of the variables, there are more duplicated obs.

. //Note different math score for 1st & 2nd obs

9 male middle 48 49 52 8 female low 39 44 52 7 male middle 57 54 59 6 female low 47 41 46 5 male low 47 40 43 5 male low 47 40 43 4 female low 44 50 41 4 female low 44 50 41 3 male low 63 65 48 3 male low 63 65 48 2 female middle 39 41 33 2 female middle 39 41 33 2 female middle 39 41 33 1 female low 34 44 40 1 female low 34 44 84 id female ses read write math

. list in 1/15, noobs compress separator(15)

. use http://fhssweb4:1019/duplicates.dta, clear

View the Duplicates in the Data (1b.)

4 11 5 male low 47 40 43 4 10 5 male low 47 40 43 3 9 4 female low 44 50 41 3 8 4 female low 44 50 41 2 7 3 male low 63 65 48 2 6 3 male low 63 65 48 1 5 2 female middle 39 41 33 1 4 2 female middle 39 41 33 1 3 2 female middle 39 41 33 group: obs: id female ses read write math


. duplicates list, sepby(id) //new line when id changes

5 11 5 male low 5 10 5 male low 4 9 4 female low 4 8 4 female low 3 7 3 male low 3 6 3 male low 2 5 2 female middle 2 4 2 female middle 2 3 2 female middle 1 2 1 female low 1 1 1 female low group: obs: id female ses


. duplicates list id female ses, sepby(id)

4 observations are completely duplicated in all variables: the first one 3 times and the others twice; Stata creates a different “group:” for each observation that appears duplicated

5 observations are duplicated in id, female, and ses, because observations 1 and 2 only differ in math

Create a Variable to Tag Duplicates (1c.)

New variable is 0 if the observation is unique, 1 if there is one duplicate of it, 2 if there are two duplicates of it, etc.

We can see the difference in math scores for observation 1 and 2, which is why duplicates report and duplicates report id female ses gave us different outputs. Let’s set them both equal to 84.

11. 5 male low 47 40 43 1 10. 5 male low 47 40 43 1 9. 4 female low 44 50 41 1 8. 4 female low 44 50 41 1 7. 3 male low 63 65 48 1 6. 3 male low 63 65 48 1 5. 2 female middle 39 41 33 2 4. 2 female middle 39 41 33 2 3. 2 female middle 39 41 33 2 2. 1 female low 34 44 40 1 1. 1 female low 34 44 84 1 id female ses read write math dup_id

. list if dup_id >=1, sepby(id)


. duplicates tag id female ses, gen(dup_id)

(1 real change made). replace math = 84 if id ==1

Drop the Duplicate Observations (1d.)

1 200 0 copies observations surplus


. duplicates report

(6 observations deleted)


. duplicates drop

The command duplicates drop drops all observations that are duplicated, leaving just the first observation in each group.

Now we run duplicates report to check that all of the duplicate observations have been deleted.

Label the Values of a Numeric Variable (2a.)

Variable foreign currently displayed as binary numeric variable.

Dodge Colt 3,984 30 domestic car Chev. Nova 3,955 19 domestic car Renault Le Car 3,895 26 foreign car Merc. Bobcat 3,829 22 domestic car AMC Spirit 3,799 22 domestic car Subaru 3,798 35 foreign car Toyota Corolla 3,748 31 foreign car Chev. Monza 3,667 24 domestic car Chev. Chevette 3,299 29 domestic car Merc. Zephyr 3,291 20 domestic car make price mpg foreign

. list in 1/10, noobs

. label values foreign foreign_lbl

. label define foreign_lbl 0 "domestic car" 1 "foreign car"

The labels are now displayed for the Variable foreign, which is more helpful, but the actual values in the data are still 0 and 1.

Creates labeling scheme called “foreign_lbl”, but nothing happens to data yet

Applies labeling scheme “foreign_lbl” to the variable foreign

Dodge Colt 3,984 30 0 Chev. Nova 3,955 19 0 Renault Le Car 3,895 26 1 Merc. Bobcat 3,829 22 0 AMC Spirit 3,799 22 0 Subaru 3,798 35 1 Toyota Corolla 3,748 31 1 Chev. Monza 3,667 24 0 Chev. Chevette 3,299 29 0 Merc. Zephyr 3,291 20 0 make price mpg foreign


. use val_labels.dta, clear. use http://fhssweb4:1019/valuelabels.dta, clear

Now Let’s Look at the Code In-Depth (2a.)

Dodge Colt 3,984 30 domestic car Chev. Nova 3,955 19 domestic car Renault Le Car 3,895 26 foreign car Merc. Bobcat 3,829 22 domestic car AMC Spirit 3,799 22 domestic car Subaru 3,798 35 foreign car Toyota Corolla 3,748 31 foreign car Chev. Monza 3,667 24 domestic car Chev. Chevette 3,299 29 domestic car Merc. Zephyr 3,291 20 domestic car make price mpg foreign


. label values foreign foreign_lbl

. label define foreign_lbl 0 "domestic car" 1 "foreign car"

Says we want to define a labeling scheme that will be stored in Stata’s memory, and later applied to variables

Name of the labeling scheme that we want to create

The actual labeling scheme: which labels go with which numbers

Says we want to apply a labeling scheme to a specific variable

Name of the variable to which we want to apply the labeling scheme

Name of the labeling scheme we want to apply

Create Variable Labels (2b.)

Variable we want to label

Label we want to give it

Note the difference between variable label and value label

female byte %8.0g sexlbl hbp byte %8.0g high blood pressurerace byte %8.0g age_grp byte %8.0g year int %8.0g city byte %8.0g id str10 %10s Record identification number variable name type format label variable label storage display value size: 19,210 vars: 7 22 Jan 2011 11:12 obs: 1,130 Contains data from http://www.stata-press.com/data/r12/hbp4.dta

. describe

. label variable hbp "high blood pressure"

. webuse hbp4

. webuse hbp4, clear

Create a Labeled Categorical Variable from a Continuous Numeric Variable (3.)


> greater than 30 mpg=efficient". label variable efficiency "1-14 mpg=inefficient; 15-30 mpg=efficient;

(74 differences between mpg and efficiency)> (30/max=3 "efficient"), gen(efficiency) label(effcny_lbl). recode mpg (min/14=1 "inefficient") (15/30=2 "moderately efficient")

Cad. Deville 14 domestic car inefficient Linc. Continental 12 domestic car inefficient Volvo 260 17 foreign car moderately efficient Peugeot 604 14 foreign car inefficient Linc. Versailles 14 domestic car inefficient Linc. Mark V 12 domestic car inefficient Cad. Eldorado 14 domestic car inefficient Cad. Seville 21 domestic car moderately efficient make mpg foreign efficiency

. list in 1/8, noobs nolabel ab(10)

Cad. Deville 14 0 1 Linc. Continental 12 0 1 Volvo 260 17 1 2 Peugeot 604 14 1 1 Linc. Versailles 14 0 1 Linc. Mark V 12 0 1 Cad. Eldorado 14 0 1 Cad. Seville 21 0 2 make mpg foreign efficiency

We have a continuous numeric variable (mpg)…

…but instead we want a variable which groups observations into 3 categories, based on mpg …

…note that the actual values of the new variable are numbers, but it will display value labels. This is what we need for analysis.

Cad. Deville 14 domestic car Linc. Continental 12 domestic car Volvo 260 17 foreign car Peugeot 604 14 foreign car Linc. Versailles 14 domestic car Linc. Mark V 12 domestic car Cad. Eldorado 14 domestic car Cad. Seville 21 domestic car make mpg foreign . list in 1/8, noobs

. use http://fhssweb4:1019/recode.dta, clear

Now Let’s Look at the Code In-Depth (3.)

> greater than 30 mpg=efficient". label variable efficiency "1-14 mpg=inefficient; 15-30 mpg=efficient;

> (30/max=3 "efficient"), gen(efficiency) label(effcny_lbl). recode mpg (min/14=1 "inefficient") (15/30=2 "moderately efficient")

Says that rather than alter the values of mpg, we want to just create a new variable called efficiency

The set of value labels that we are defining will be saved as effcny_lbl in Stata’s memory

This just means that the command took up more than one line

Change the values of a variable based on some coding rules

Variable who’s values I want to change

First rule: If the value is between the lowest number and 14, make it to a 1…

…and give it a value label of “inefficient”

Create a variable label (not to be confused with a value label) describing how the coding rules work

11 22

33

44 55

66 77

88

Covert a String Variable Containing Digits into a Numeric Variable (4a.)

numid double %10.0g Record identification numberid str10 %10s Record identification number variable name type format label variable label storage display value

. describe id numid

10. 8003187296 8.003e+09 9. 8005012348 8.005e+09 8. 8006962950 8.007e+09 7. 8004411604 8.004e+09 6. 8007340259 8.007e+09 5. 8006142590 8.006e+09 4. 8006167153 8.006e+09 3. 8000468015 8.000e+09 2. 8007143470 8.007e+09 1. 8008238923 8.008e+09 id numid

. list id numid in 1/10

id has all characters numeric; numid generated as double. destring id, generate(numid)

. use http://www.stata-press.com/data/r12/hbp2, clear

10. 8003187296 8003187296 9. 8005012348 8005012348 8. 8006962950 8006962950 7. 8004411604 8004411604 6. 8007340259 8007340259 5. 8006142590 8006142590 4. 8006167153 8006167153 3. 8000468015 8000468015 2. 8007143470 8007143470 1. 8008238923 8008238923 id numid

. list id numid in 1/10

. format numid %10.0f

Create numeric variable

Notice the default exponential format

Use fixed format to display

Automatically Create a Labeled Numeric Variable from a String Variable (4b.)

Total 433 695 1,128 male 0 695 695 female 433 0 433 sex female male Total gender

. tab sex gender

Total 433 695 1,128 male 0 695 695 female 433 0 433 sex 1 2 Total gender

. tab sex gender, nolabel

gender long %8.0g gender sex str6 %9s variable name type format label variable label storage display value

. describe sex gender

. encode sex, generate(gender)

. use http://www.stata-press.com/data/r12/hbp2, clear

New labeled numeric variable

Note: The numeric values assigned as integers beginning with 1 are ordered by the alphabetized values of the original string variable

Original string variable

Data values

Makes a new numeric variable, with value labels containing the text from the original variable

Value labels

Reshape Wide to Long (5a.1)

3. 3 0 3000 2000 1000 0 0 1 2. 2 1 2000 2200 3300 1 0 0 1. 1 0 5000 5500 6000 0 1 0 id sex inc80 inc81 inc82 ue80 ue81 ue82

. list

. webuse reshape1, clear

9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue

When you have a wide dataset … but need a long one

. list

ue80 ue81 ue82 -> ue inc80 inc81 inc82 -> incxij variables:j variable (3 values) -> yearNumber of variables 8 -> 5Number of obs. 3 -> 9 Data wide -> long

(note: j = 80 81 82). reshape long inc ue, i(id) j(year)

You can reshape the data from wide to long

wide

long

Why would you do this?Some Stata statistical procedures (e.g. xtreg for panel data) require the data to be in long form

Let’s Look at the Code In-Depth (5a.1)


. list


9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue

We want our data to end up in long form

The two vars that currently have numbers tacked on the end of their names; the ones we want to reshape. In Stata these are called “stubs”.

. list



Take the numbers off the end of the reshape vars, and put them in a new var called “year”

This specifies a unique individual

Reshape Wide to Long Without ID (5a.2)

3. 0 3000 2000 1000 0 0 1 3 2. 1 2000 2200 3300 1 0 0 2 1. 0 5000 5500 6000 0 1 0 1 sex inc80 inc81 inc82 ue80 ue81 ue82 id

. list

. generate id=_n

3. 0 3000 2000 1000 0 0 1 2. 1 2000 2200 3300 1 0 0 1. 0 5000 5500 6000 0 1 0 sex inc80 inc81 inc82 ue80 ue81 ue82

. list

. drop id


9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue

. list, sepby(id)



What if there is no ID variable?

Let’s create one

Reshape Long to Wide (5b.)

9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue


When you have a long dataset… but need a wide dataset

You can reshape the data from long to wide … and optionally reorder the variables

. list

. order id sex inc80 inc81 inc82 ue80 ue81 ue82

ue -> ue80 ue81 ue82 inc -> inc80 inc81 inc82xij variables:j variable (3 values) year -> (dropped)Number of variables 5 -> 8Number of obs. 9 -> 3 Data long -> wide

(note: j = 80 81 82). reshape wide inc ue, i(id) j(year)

The order command serves only to rearrange the sequence of the variables on the file

long wide

Let’s Look at the Code In-Depth(5b.)

9. 3 82 0 1000 1 8. 3 81 0 2000 0 7. 3 80 0 3000 0 6. 2 82 1 3300 0 5. 2 81 1 2200 0 4. 2 80 1 2000 1 3. 1 82 0 6000 0 2. 1 81 0 5500 1 1. 1 80 0 5000 0 id year sex inc ue


. list

. order id sex inc80 inc81 inc82 ue80 ue81 ue82

ue -> ue80 ue81 ue82 inc -> inc80 inc81 inc82xij variables:j variable (3 values) year -> (dropped)Number of variables 5 -> 8Number of obs. 9 -> 3 Data long -> wide

(note: j = 80 81 82). reshape wide inc ue, i(id) j(year)

long wide

We want our data to end up in wide form

The two vars that change each year, that we want to stick numbers on the end of

Take the values in the variable “year”, and stick them on the end of inc and ue

This specifies a unique individual

What We Will Cover After the Break (6.)

• Combining multiple datasets vertically (append and preserve/restore)

• Save subsets of observations to different datasets

• Combining multiple datasets horizontally (1:1 merge)

• Save subsets of variables to different datasets

• m:1 (many-to-one) merging of datasets

• Extract group and individual data from multilevel datasets (collapse)

• Execute commands by groups (bysort)

• Create new variables based on data summaries and functions (egen)

• Create standardized scores and deviation scores (sd and std)

• Automate the same tasks for multiple variables (foreach loops)

• Global and local macros and looping

Append Multiple Datasets and Generate a Labeled Source Identifier (7a.)

3. Ventura 798364 2. Orange 2997033 1. Los Angeles 9878554 county pop

capop

3. Will 673586 2. DeKalb 103729 1. Cook 5285107 county pop

ilpop

3. Harris 4011475 2. Johnson 149797 1. Brazos 152415 county pop

txpop

9. Harris 4011475 TX 8. Johnson 149797 TX 7. Brazos 152415 TX 6. Will 673586 IL 5. DeKalb 103729 IL 4. Cook 5285107 IL 3. Ventura 798364 CA 2. Orange 2997033 CA 1. Los Angeles 9878554 CA county pop state

Combine several datasets with the same variables but different observations …

into a single dataset, while identifying the source of the data

. list, sep(0)

. label values state statelab

. label define statelab 0 "CA" 1 "IL" 2 "TX"

. append using ilpop txpop, generate(state)

. use capop, clear

Appending Datasets (7a.)

. list, sep(0)

. label values state statelab

. label define statelab 0 "CA" 1 "IL" 2 "TX"

. append using ilpop txpop, generate(state)

. use capop, clear

Open the master datasets

Append the other datasets to the first one

Generate a variable identifying the data source: Consecutive integers beginning with 0

Define and name a label for the new source identifier variable

Apply the label to the source identifier variable

Save Subsets of Observations to Separate Datasets (7b.)

. restore

file TX.dta saved. save TX, replace

(6 observations deleted). keep if (state==2)

. preserve

. restore

file IL.dta saved. save IL, replace


. preserve

. restore

file CA.dta saved. save CA, replace


. preserve

. ***Save subsets of cases

3. Harris 4011475 TX 2. Johnson 149797 TX 1. Brazos 152415 TX county pop state

. list

. use TX, clear

3. Will 673586 IL 2. DeKalb 103729 IL 1. Cook 5285107 IL county pop state

. list

. use IL, clear

3. Ventura 798364 CA 2. Orange 2997033 CA 1. Los Angeles 9878554 CA county pop state

. list

. use CA, clear

9. Harris 4011475 2 8. Johnson 149797 2 7. Brazos 152415 2 6. Will 673586 1 5. DeKalb 103729 1 4. Cook 5285107 1 3. Ventura 798364 0 2. Orange 2997033 0 1. Los Angeles 9878554 0 county pop state

. list, nolabel sep(0)

Create Separate files Containing Subsets of the Observations (7b.)

. restore

file TX.dta saved. save TX, replace


. preserve

. restore

file IL.dta saved. save IL, replace


. preserve

. restore

file CA.dta saved. save CA, replace


. preserve

. ***Save subsets of cases

Create a temporary backup of the dataset

Keep only a subset of the observations

Save the subset dataset

Restore the dataset to its original state from the temporary backup

Merge Files Containing the Same Observations but Different Variables (8a.)

6. Plym. Arrow 3,260 170 5. Datsun 210 2,020 165 4. Pont. Grand Prix 3,210 201 3. Cad. Seville 4,290 204 2. BMW 320i 2,650 177 1. Toyota Celica 2,410 174 make weight length

5. Datsun 210 4,589 35 4. Pont. Grand Prix 5,222 19 3. Cad. Seville 15,906 21 2. BMW 320i 9,735 25 1. Toyota Celica 5,899 18 make price mpg

6. Toyota Celica 2,410 174 5,899 18 matched (3) 5. Pont. Grand Prix 3,210 201 5,222 19 matched (3) 4. Plym. Arrow 3,260 170 . . master only (1) 3. Datsun 210 2,020 165 4,589 35 matched (3) 2. Cad. Seville 4,290 204 15,906 21 matched (3) 1. BMW 320i 2,650 177 9,735 25 matched (3) make weight length price mpg _merge

Merge data from two datasets with the same observations, but different variables (except for the key)

autosize (master) autoexpense (using)

matched 5 (_merge==3)

from using 0 (_merge==2) from master 1 (_merge==1) not matched 1 Result # of obs.

. merge 1:1 make using autoexpense

(1978 Automobile Data). use autosize, clear

merged

key

1:1 (Match) Merging (8a.)

matched 5 (_merge==3)

from using 0 (_merge==2) from master 1 (_merge==1) not matched 1 Result # of obs.

. merge 1:1 make using autoexpense

(1978 Automobile Data). use autosize, clear

Open one of the datasets

Do a match merge

Based on a common key variable which uniquely identifies each observation across both datasets

Merge with the other dataset

Observations with data from both datasets

Observations with data from just one dataset

Save Subsets of Variables to Separate Datasets (8b.)

6. Toyota Celica 2,410 174 5,899 18 matched (3) 5. Pont. Grand Prix 3,210 201 5,222 19 matched (3) 4. Plym. Arrow 3,260 170 . . master only (1) 3. Datsun 210 2,020 165 4,589 35 matched (3) 2. Cad. Seville 4,290 204 15,906 21 matched (3) 1. BMW 320i 2,650 177 9,735 25 matched (3) make weight length price mpg _merge

. restore

file EXPENSE.dta saved. save EXPENSE, replace

. keep make price mpg

. preserve

. restore

file SIZE.dta saved. save SIZE, replace

. keep make weight length

. preserve

. ***Save subsets of variables

.

6. Toyota Celica 2,410 174 5. Pont. Grand Prix 3,210 201 4. Plym. Arrow 3,260 170 3. Datsun 210 2,020 165 2. Cad. Seville 4,290 204 1. BMW 320i 2,650 177 make weight length

. list, sep(0)

(1978 Automobile Data). use SIZE, clear

6. Toyota Celica 5,899 18 5. Pont. Grand Prix 5,222 19 4. Plym. Arrow . . 3. Datsun 210 4,589 35 2. Cad. Seville 15,906 21 1. BMW 320i 9,735 25 make price mpg

. list, sep(0)

(1978 Automobile Data). use EXPENSE, clear

Save Subsets of Variables to Separate Datasets (8b.)

. restore

file EXPENSE.dta saved. save EXPENSE, replace

. keep make price mpg

. preserve

. restore

file SIZE.dta saved. save SIZE, replace

. keep make weight length

. preserve

. ***Save subsets of variables

. Backup before subsetting variables Keep the first

variable subsetSave the first subset as a Stata data file

Restore the backup datasetMake sure the

key variable is included in both subsets

Distribute Group-level Information Across Individual-level Observations (9a.)

12. West Grant 11. West Cobb 10. West Charles 9. South McNeil 8. South Lee 7. South Dubnoff 6. South Anderson 5. NE Franks 4. NE Ecklund 3. N Cntrl Willis 2. N Cntrl Phipps 1. N Cntrl Krantz region name

4. West 310,565 165,348 3. South 532,399 330,499 2. NE 360,523 138,097 1. N Cntrl 419,472 227,677 region sales cost

matched 12 (_merge==3) not matched 0 Result # of obs.

(label region already defined). merge m:1 region using dollars

(Sales Force). use sforce, clear

12. West Grant 310,565 165,348 matched (3) 11. West Cobb 310,565 165,348 matched (3) 10. West Charles 310,565 165,348 matched (3) 9. South McNeil 532,399 330,499 matched (3) 8. South Lee 532,399 330,499 matched (3) 7. South Dubnoff 532,399 330,499 matched (3) 6. South Anderson 532,399 330,499 matched (3) 5. NE Franks 360,523 138,097 matched (3) 4. NE Ecklund 360,523 138,097 matched (3) 3. N Cntrl Willis 419,472 227,677 matched (3) 2. N Cntrl Phipps 419,472 227,677 matched (3) 1. N Cntrl Krantz 419,472 227,677 matched (3) region name sales cost _merge

sforce

dollars

key

Look up the variable values in “dollars” and attach them to the records in “sforce”

merged

m:1 Many-to-One (Lookup) Merging (9a.)

matched 12 (_merge==3) not matched 0 Result # of obs.

(label region already defined). merge m:1 region using dollars

(Sales Force). use sforce, clear

Level 1 dataset

Key Variable

Level 2 dataset

Lookup merging

Extract the Individual- and Group-Level Data from a Multilevel Data Set (9b.)

. restore

file lev1.dta saved. save lev1, replace

mathach 7185 6031 12.74785 -2.832 24.993 ses 7185 373 .0001434 -3.758 2.692 female 7185 2 .5281837 0 1 minority 7185 2 .274739 0 1 id 7185 160 . . . Variable Obs Unique Mean Min Max Label

. codebook, compact

. keep id minority female ses mathach

. preserve

. ***Write out the individual-level data

.

. sort id

. use http://www.ats.ucla.edu/stat/hlm/faq/hsball, clear

. ***Get and sort the multilevel data

himinty 160 2 .275 0 1 (mean) himintydisclim 160 159 -.015125 -2.416 2.756 (mean) disclimpracad 160 73 .5139375 0 1 (mean) pracadsector 160 2 .4375 0 1 (mean) sectorsize 160 149 1097.825 100 2713 (mean) sizemeanses 160 150 -.0001875 -1.188 .831 (mean) meansesid 160 160 . . . Variable Obs Unique Mean Min Max Label

. codebook, compact


. collapse (mean) meanses size sector pracad disclim himinty, by(id)

. ***Write out the school-level data

Number of schools

Number of students Note: Requires that the school-level variables in the original multilevel data have the same (constant) values for every student within a given school.

Separating Level 1 and Level 2 Data (9b.)

. restore



. codebook, compact


. preserve


.

. sort id



Sort by the group identifier

. restore



. codebook, compact


. preserve


.

. sort id



Keep the level 1 variables

Save the level 1 data

himinty 160 2 .275 0 1 (mean) himintydisclim 160 159 -.015125 -2.416 2.756 (mean) disclimpracad 160 73 .5139375 0 1 (mean) pracadsector 160 2 .4375 0 1 (mean) sectorsize 160 149 1097.825 100 2713 (mean) sizemeanses 160 150 -.0001875 -1.188 .831 (mean) meansesid 160 160 . . . Variable Obs Unique Mean Min Max Label

. codebook, compact


. collapse (mean) meanses size sector pracad disclim himinty, by(id)

. ***Write out the school-level data

Get the group means of the level 2 variables

Save the level 2 dataset

Aggregating Data by Subgroups [With Frequency Weights] (10.)

12. 2.9 31 4 2 11. 3.4 32 4 5 10. 3.3 33 3 3 9. 2.2 35 3 2 8. 3.7 30 3 4 7. 2.9 35 2 5 6. 2.5 30 2 4 5. 3.8 29 2 3 4. 2.1 30 1 4 3. 2.8 28 1 9 2. 3.5 34 1 2 1. 3.2 30 1 3 gpa hour year number

4. 4 3.257143 31.71428 3.4 32 3. 3 3.233333 32.11111 3.3 33 2. 2 2.991667 31.83333 2.9 30 1. 1 2.788889 29.44444 2.8 29 year gpa hour medgpa medhour

college

. list

. collapse (mean) gpa hour (median) medgpa=gpa medhour=hour [ fw = number ], by(year)

. use college, clear

frequency weights

aggregated

Produce a new file with a single observation for each group of records in the original data set. This example produces the group means and medians.

Execute Commands by Subgroups (11a.)

• - bysort runs a stata command separately for each value of a for each value of a variable

consideration. bysort does that

• ‘bysort’ runs a command separately for each value of a variable

• Using just ‘by’ requires the data to be sorted by the variable in consideration. ‘bysort’ does that for you

Runs separate regressions for observations when foreign=“domestic” and when foreign=“foreign”

Summarizes the variables price & mpg when foreign=“domestic” and foreign=“foreign”

Using bysort to Identify Duplicates (11b.)

4 groups of duplicates

It is important to note that bysort cannot be used with every stata commands eg- scatter, histogram etc.

Within-observation Across-variables Data Summaries (12a.)

4. 10 11 12 33 3 11 11 10 12 3. 7 8 . 15 2 7.5 7.5 7 8 2. 4 . 6 10 2 5 5 4 6 1. . 2 3 5 2 2.5 2.5 2 3 a b c rtot rn rmean rmed rmin rmax

. list

. egen rmax = rowmax(a b c) //row maximum

. egen rmin = rowmin(a b c) //row minimum

. egen rmed = rowmedian(a b c) //row median

. egen rmean = rowmean(a b c) //row mean

. egen rn = rownonmiss(a b c) //row n

. egen rtot = rowtotal(a b c) //row total

. use http://www.stata-press.com/data/r12/egenxmpl4, clear

Create new variables that are statistical functions of multiple original variables for each observation

Example statistical functions

Within-variable Across-observations Data Summaries (12b.)

10. Buick Skylark 20 Buick 19 19.28572 9. Buick Riviera 15 Buick 19 19.28572 8. Buick Regal 20 Buick 19 19.28572 7. Buick Opel 25 Buick 19 19.28572 6. Buick LeSabre 20 Buick 19 19.28572 5. Buick Electra 15 Buick 19 19.28572 4. Buick Century 20 Buick 19 19.28572 3. AMC Spirit 20 AMC 19 18.33333 2. AMC Pacer 15 AMC 19 18.33333 1. AMC Concord 20 AMC 19 18.33333 make mpg mfg vm_mpg gm_mpg

. list, sepby(mfg) //list by mfg

. bysort mfg: egen gm_mpg=mean(mpg) //mpg group mean

. egen vm_mpg=mean(mpg) //mpg dataset mean

. format mfg %-7s //left align the mfg variable

. generate mfg=word(make,1) //extract manufacturer from make

(64 observations deleted). keep in 1/10 //keep the first 10 observations

. keep make mpg //keep make and mpg

(1978 Automobile Data). sysuse autornd.dta, clear //get the data Create new variables that are statistical

functions of individual original variables across all, or groups of, the observations

Means for the whole sample

Means for subgroups

Creating Standardized Scores and Deviation Scores (13.)

10. Buick Skylark 20 19 3.162278 .3162278 1 9. Buick Riviera 15 19 3.162278 -1.264911 -4 8. Buick Regal 20 19 3.162278 .3162278 1 7. Buick Opel 25 19 3.162278 1.897367 6 6. Buick LeSabre 20 19 3.162278 .3162278 1 5. Buick Electra 15 19 3.162278 -1.264911 -4 4. Buick Century 20 19 3.162278 .3162278 1 3. AMC Spirit 20 19 3.162278 .3162278 1 2. AMC Pacer 15 19 3.162278 -1.264911 -4 1. AMC Concord 20 19 3.162278 .3162278 1 make mpg vm_mpg vs_mpg vz_mpg vd_mpg

. list

. generate vd_mpg=mpg-vm_mpg //mpg deviation scores

. egen vz_mpg=std(mpg) //mpg z-scores

. egen vs_mpg=sd(mpg) //mpg standard deviation

. egen vm_mpg=mean(mpg) //mpg dataset mean

(64 observations deleted). keep in 1/10 //keep the first 10 observations

. keep make mpg //keep make and mpg

(1978 Automobile Data). sysuse autornd.dta, clear //get the data

Standardized scores

Deviations from the variable’s meanAKA Grand mean centering

Create and Format Multiple Variables at Once (14a.)

10. Buick Skylark 4,082 19 3.5 9. Buick Riviera 10,372 16 3.5 8. Buick Regal 5,189 20 2.0 7. Buick Opel 4,453 26 3.0 6. Buick LeSabre 5,788 18 4.0 5. Buick Electra 7,827 15 4.0 4. Buick Century 4,816 20 4.5 3. AMC Spirit 3,799 22 3.0 2. AMC Pacer 4,749 17 3.0 1. AMC Concord 4,099 22 2.5 make price mpg headroom

. list in 1/10

. keep make price mpg headroom

(1978 Automobile Data). sysuse auto.dta, clear

10. Buick Skylark 4,082 19 3.5 -0.71 -0.40 0.60 9. Buick Riviera 10,372 16 3.5 1.43 -0.92 0.60 8. Buick Regal 5,189 20 2.0 -0.33 -0.22 -1.17 7. Buick Opel 4,453 26 3.0 -0.58 0.81 0.01 6. Buick LeSabre 5,788 18 4.0 -0.13 -0.57 1.19 5. Buick Electra 7,827 15 4.0 0.56 -1.09 1.19 4. Buick Century 4,816 20 4.5 -0.46 -0.22 1.78 3. AMC Spirit 3,799 22 3.0 -0.80 0.12 0.01 2. AMC Pacer 4,749 17 3.0 -0.48 -0.74 0.01 1. AMC Concord 4,099 22 2.5 -0.70 0.12 -0.58 make price mpg headroom z_price z_mpg z_head~m

. list in 1/10

4. } 3. format z_`v' %6.2f 2. egen z_`v'=std(`v'). foreach v in price mpg headroom {Stata puts these line

numbers in the output even though they are not in the do file

Create and Check Dummy Variables (14b.)

Total 28,534 100.00 88 2,272 7.96 100.00 87 2,164 7.58 92.04 85 2,085 7.31 84.45 83 1,987 6.96 77.15 82 2,085 7.31 70.18 80 1,847 6.47 62.88 78 1,964 6.88 56.40 77 2,171 7.61 49.52 75 2,141 7.50 41.91 73 1,981 6.94 34.41 72 1,693 5.93 27.47 71 1,851 6.49 21.53 70 1,686 5.91 15.05 69 1,232 4.32 9.14 68 1,375 4.82 4.82 year Freq. Percent Cum. interview

. tabulate year, generate(yr)

(National Longitudinal Survey. Young Women 14-26 years of age in 1968). use http://www.stata-press.com/data/r12/nlswork.dta,clear. *Dummy variables

Total 28,534 100.00 1 1,375 4.82 100.00 0 27,159 95.18 95.18 68.0000 Freq. Percent Cum. year==

3. } 2. tab `x'. foreach x of varlist yr1-yr15{. *Tabulate to verify

--Some output omitted--

Total 28,534 100.00 1 2,272 7.96 100.00 0 26,262 92.04 92.04 88.0000 Freq. Percent Cum. year==

Macros (15.)

MartinTyrellJoseStevenJake 3. } 2. di "`x'". foreach x in `names2' {

DavidJoeChongMingNickBallav 3. } 2. di "`x'". foreach x in $names {

Jake Steven Jose Tyrell Martin. di "`names2'"

. local names2 "Jake Steven Jose Tyrell Martin"

. global names "Ballav Nick ChongMing Joe David"

. // Lists

_cons .8044773 .3080703 2.61 0.010 .197013 1.411942 math .0066498 .0067761 0.98 0.328 -.0067116 .0200113 write .0051675 .0073557 0.70 0.483 -.0093367 .0196717 read .0135099 .0066418 2.03 0.043 .0004133 .0266065 female -.17103 .1049502 -1.63 0.105 -.3779747 .0359146 ses Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 108.762136 205 .530547004 Root MSE = .69518 Adj R-squared = 0.0891 Residual 97.137997 201 .483273617 R-squared = 0.1069 Model 11.6241389 4 2.90603473 Prob > F = 0.0001 F( 4, 201) = 6.01 Source SS df MS Number of obs = 206

. reg ses $ind_vars

. global ind_vars "female read write math"

. // Macros can also be used to specify variables.

Global – Exists until STATA is closed, or a “clear all” command is used.

Local – temporary macro, disappears when do file finishes running

Macros can be used for many things. Two examples are:1)Lists or other storage 2)Variables

Key Data Management Tasks in Stata

Documents

Transcript of Key Data Management Tasks in Stata