Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers...

37
Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This guide will take you from beginners level through to advanced tips and hints There is a complimentary series of workshops covering much of the material taught in this guide, called Stata Introduction and Data Management Course, referenced SIDM in this guide. Engagement with these exercises will help you to remember the main features of Stata data management available to you. They will give you practice in writing commands in practice, in different situations. Absolute beginners, navigating Stata Reading data into Stata and reformatting Saving data sets and results Importance of saving commands do files Creating and labelling new variables Importance of checking data for errors Merging data sets and restructuring Advanced tips & hints Further resources for learning Stata Written for Stata 13 but generally good for most versions of Stata. I personally felt I knew stata when I had a good feel for data manipulation. This knowledge, along with some basic descriptive statistics commands and data analysis commands is what you need to be able to work with many different datasets. Without knowledge of data management commands, I’ve seen people doing things in ways which are far more time consuming. So looking through these hints may save you loads of time.

Transcript of Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers...

Page 1: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops

Stata Data Management

This guide will take you from beginners level through to advanced tips and hints

There is a complimentary series of workshops covering much of the material

taught in this guide, called Stata Introduction and Data Management

Course, referenced SIDM in this guide. Engagement with these exercises will

help you to remember the main features of Stata data management

available to you. They will give you practice in writing commands in practice,

in different situations.

Absolute beginners, navigating Stata

Reading data into Stata and reformatting

Saving data sets and results

Importance of saving commands do files

Creating and labelling new variables

Importance of checking data for errors

Merging data sets and restructuring

Advanced tips & hints

Further resources for learning Stata

Written for Stata 13 but generally good for most versions of Stata.

I personally felt I knew stata when I had a good feel for data manipulation.

This knowledge, along with some basic descriptive statistics commands and

data analysis commands is what you need to be able to work with many

different datasets. Without knowledge of data management commands, I’ve

seen people doing things in ways which are far more time consuming. So

looking through these hints may save you loads of time.

Page 2: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 2 SIDM refers to Stata Intro and Data Management series of workshops

Contents

1 Finding your way around Stata ....................................................................................................... 5

1.1 Stata resources to help you learn, elementary and advanced. .............................................. 5

1.2 Stata Layout for beginners ...................................................................................................... 5

1.3 Opening Stata windows, including any that you may have lost ............................................. 6

1.4 The Stata Toolbar .................................................................................................................... 6

1.5 Stata command syntax ............................................................................................................ 8

2 Reading Data into Stata .................................................................................................................. 8

2.1 Changing the working directory in Stata ................................................................................ 8

2.2 Opening a Stata dataset .......................................................................................................... 8

2.3 Opening data from an excel file .............................................................................................. 9

2.4 Opening data from a CSV file .................................................................................................. 9

2.5 Hints on preparing data to be read in ..................................................................................... 9

3 Viewing your data in Stata, data type and data display formats. ................................................. 10

3.1 Variables window .................................................................................................................. 10

3.2 Data editor/ browser window............................................................................................... 10

3.3 Browse command ................................................................................................................. 10

3.4 Describe command ............................................................................................................... 10

3.5 Further information on data storage types and formats ...................................................... 11

3.6 Stata Missing Data Codes ...................................................................................................... 11

3.7 A few basic commands ......................................................................................................... 12

3.8 To permanently reorder data: .............................................................................................. 13

3.9 To find a lost variable or to explore what variables are available: ....................................... 13

4 Saving your work in Stata/ from Stata .......................................................................................... 13

4.1 Essential to Create a do file of data commands ................................................................... 13

4.2 Saving Stata Output .............................................................................................................. 13

4.3 Creating tables from stata output ........................................................................................ 14

4.4 Saving data in Stata format ................................................................................................... 14

4.5 Saving data in Excel format ................................................................................................... 14

4.6 Saving data in csv format ...................................................................................................... 15

4.7 Saving graphs and combining graphs .................................................................................... 15

5 Creating and recoding variables; Putting Data into a Useful format for Analysis ........................ 15

5.1 Creating new variables with maths operations and other functions ................................... 15

Page 3: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 3 SIDM refers to Stata Intro and Data Management series of workshops

5.2 IF statements to restrict commands to only some rows/ observations: .............................. 16

5.3 IN statement to restrict command to only some numbered rows/ observations: .............. 17

5.4 Labelling variable names, creating value labels, labelling data sets ..................................... 18

5.5 Converting string variables to numeric data (destring) to enable analysis. ......................... 19

5.6 Converting string into labelled categorical data (encode) & vice versa (decode) ................ 20

5.7 Tidying up strings before encoding; further string functions ............................................... 22

5.8 Recoding numeric & categorical variables ............................................................................ 22

5.9 Categorical variables from numerical data (e.g. quintiles, specified cut-offs) ..................... 23

5.10 Creating a new variable counting the observation number, or repeat number .................. 24

6 Checking data for errors: the first essential part of any data analysis ......................................... 24

6.1 Looking out for missing data in stata .................................................................................... 24

6.2 Consider if other values need to be recoded to missing ...................................................... 25

6.3 Look out for inconsistencies in units ..................................................................................... 25

6.4 Look out for undefined category values ............................................................................... 25

6.5 Missing Dates: do you have many dates the same and you don’t know why? .................... 26

6.6 Check for consistencies between variables .......................................................................... 26

6.7 Check inconsistencies and outlying values to the original data source ................................ 26

6.8 To check for duplicate people/ observations in your data set: ............................................ 26

6.9 Check every new variable that you create ............................................................................ 26

7 Dates and times ............................................................................................................................ 26

7.1 Create a new date variable recognised by stata in calculations as date .............................. 26

7.2 Create a date and time variable that stata recognises as such from a string variable ......... 28

7.3 Convert from clock format to standard date format ............................................................ 28

7.4 Comparing a date variable with a definitive data that we specify: ...................................... 29

7.5 Create a stata date variable from 3 separate variables for month, year and day ................ 29

7.6 Finding the time elapsed between 2 stata date variables .................................................... 29

7.7 Extracting month or year from stata date variable and generating new variables .............. 30

8 Merging data sets and changing shape and size of data set ........................................................ 31

8.1 Combining datasets .............................................................................................................. 31

8.2 Reshaping datasets ............................................................................................................... 31

8.3 Creating variables of summary statistics .............................................................................. 32

8.4 Managing files: comparing datasets and comparing variables ............................................. 32

9 Other Useful Information .............................................................................................................. 32

9.1 To use stata as a calculator: Use display or di command: .................................................... 32

Page 4: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 4 SIDM refers to Stata Intro and Data Management series of workshops

9.2 Restricting commands to run on only a subset of the data .................................................. 32

9.3 Tabulating your data ............................................................................................................. 32

9.4 To get result from your analysis in easy to use format for producing tables ....................... 33

9.5 Summary statistics giving incidence rates and SMR and corresponding results for case-

control studies .................................................................................................................................. 33

9.6 Generating random numbers ............................................................................................... 33

9.7 To get p-values from probability distributions & vice versa: ................................................ 34

9.8 Accessing saved results of commands .................................................................................. 34

9.9 Loops in Stata ........................................................................................................................ 34

9.10 Explicit subscripting: check for duplicates, adding or subtracting data from different rows

of your dataset .................................................................................................................................. 34

9.11 Using Stata commands from the internet that are not pre-installed on Stata ..................... 35

10 Further resources for learning Stata ......................................................................................... 35

10.1 Excel spread sheet of commands.......................................................................................... 35

10.2 Interactive workshops on the material in this Stata data management guide .................... 35

10.3 Interactive workshops to teach Statistical Analysis .............................................................. 35

10.4 Stata teaching videos ............................................................................................................ 35

10.5 Stata help .............................................................................................................................. 36

10.6 Stata Manuals ....................................................................................................................... 36

10.7 Stata search ........................................................................................................................... 36

10.8 UCLA website ........................................................................................................................ 36

10.9 Stata Programs for Teaching Statistics ................................................................................. 36

10.10 Resource if you want to create your own commands ...................................................... 36

10.11 Books for learning Stata .................................................................................................... 36

10.12 Internet searches .............................................................................................................. 36

KEY: stata commands given in yellow highlight

Page 5: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 5 SIDM refers to Stata Intro and Data Management series of workshops

1 Finding your way around Stata

1.1 Stata resources to help you learn, elementary and advanced. Tour of Stata 13 or Stata 14 user interface, for those who are brand new to Stata and don’t know

their way around the interface:

https://www.youtube.com/watch?v=2Lde75owQlU

https://www.youtube.com/watch?v=3SsBePUY_eI

Tour of Stata help, to help you use this resource more fully:

https://www.youtube.com/watch?v=UpXNMeTzmuI

help commandname

help ttest gives help for the ttest

help histogram gives help for histogram (and similarly for other commands)

Sometimes the best way to use help is to go directly to near the bottom and look for their examples

– this may be all that you need. Sometimes I look through the list of options, but I don’t pay too

much attention to those options that I don’t understand. At the bottom the is occasionally a link to

the relevant teaching video (otherwise you can search youtube if you like).

Tour of Stata pdf documentation: https://www.youtube.com/watch?v=0bdQkUBQO2U

The Stata manuals are an excellent resource for teaching Statistics, as well as for teaching Stata.

They are written in a reasonably user-friendly way, to say what the different commands are doing.

They also give you technical references for the different methods, which you might want to quote.

They genearlly give the equations that are being used.

For a list of all Stata videos from the official Stata youtube channel (which you can access via “help

videos” command: http://www.stata.com/links/video-tutorials/

UCLA also have many excellent teaching resources available on their web site.

http://www.ats.ucla.edu/stat/stata/

For data management more specifically:

help data management gives a list of many commands that may be useful to you. Feel free

to browse through these. help operator gives list of operations (+, -, /, *, ^, < !=, and, or etc).

help functions is a help command that I use very frequently, to help me to create new

variables.

See also section on further resources at the end of this guide.

1.2 Stata Layout for beginners This shows the default layout of Stata, with the review, results, variables, properties and command

windows.

Page 6: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 6 SIDM refers to Stata Intro and Data Management series of workshops

It is possible to change the layout of your windows, but you don’t need to do this (though resizing

windows can be helpful, by holding the mouse over a dividing line between windows, holding the

left mouse button down and dragging the dividing line to the place required).

1.3 Opening Stata windows, including any that you may have lost E.g. if you lose the variables window, then you can click on menu option windows, and click on

relevant window to reopen it.

If you want to go to the default arrangement for your windows (i.e. see all windows above in above

layout), then use the following menu option:

Edit – Preference – load preference set – widescreen layout (default).

There are more windows available that can be opened, by clicking on relevant Stata icons:

data editor – to view your data

do-file editor – to write your Stata commands, so that they are saved, run them from here,

including any previously saved commands

log files – to save the output from your Stata commands

graphs window – which will open automatically when you produce a Stata graph

1.4 The Stata Toolbar The references below refer to the Stata manuals. You can type command “help gsw” to get you to

the getting started manual (windows version), then click on [GSW].

You can also google manual headings and find the relevant part of the manuals online.

Page 7: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 7 SIDM refers to Stata Intro and Data Management series of workshops

http://www.stata.com/manuals13/gsw.pdf – this gives pdf referred to with [GSW] above.

Type “help [GSW]” within Stata and then link on a red [GSW] link to get the manual within

Stata.

The [GSW] manual referred to here “Getting Started with Stata” is an alternative resource for

learning Stata to this one, more in depth in than this in some places, but not covering

everything in this guide. Looking at the contents may be helpful. This word guide that you are

now reading picks out specific things from this basic manual and more advanced manuals

that are likely to be especially helpful to you, in getting data into good order for analysis.

Page 8: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 8 SIDM refers to Stata Intro and Data Management series of workshops

1.5 Stata command syntax Stata is case sensitive – commands are in lower case. Variable names can be specified as either

upper case or lower case or a mixture of both, just remember to be consistent.

Many Stata commands can be abbreviated. Stata variables names can be abbreviated – so long as

the abbreviation still uniquely identifies a variable.

2 Reading Data into Stata

2.1 Changing the working directory in Stata If you want to check what directory Stata is using by default (for opening datasets and saving them

into), use the cd command:

So above, you type in “cd” into the command line, and Stata writes this command and your directory

into the results window.

The cd command can also be used to specify the default directory where Stata will look for files to

read in, and save files (unless the full file name with directories is specified when files are read in or

saved).

So from now on, Stata looks in the H:\_MPHTeaching directory for files to read in, and saves files

there (see SIDM 1 of 4).

2.2 Opening a Stata dataset Command clear removes any data which is currently in Stata working memory (or this can be used as

a subcommand on an open file command).

The simplest way to open Stata is to use the above icon, then select the relevant directory and click

on the relevant file. Double-clicking on a Stata file will open it. The corresponding command then

appears in the Results window. The variables in this data set are now listed in the Variables window.

Alternatively, rather than using the open file icon, type in the command, exactly as you see it (into

the command window or do file editor). The following 2 commands will also read in the same file

from the same directory. Firstly, the cd command changes the directory, then the use command

reads in the file.

Page 9: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 9 SIDM refers to Stata Intro and Data Management series of workshops

If you already have data read into Stata, you might get the following error message, which means

you need to save your current data first (if you wish to keep it, with any amendments you may have

made after reading it in).

Using the above version of the command, with the “clear” option, will read the babies dataset into

the memory, and clear away any dataset which was previously held within Stata without saving it

and without warning you of any possible loss of data (see SIDM 1 of 4).

2.3 Opening data from an excel file

The above imports the file babies.xls into Stata (see SIDM 2 of 4). More generally:

2.4 Opening data from a CSV file

The above imports the file babies.csv into Stata. More generally:

2.5 Hints on preparing data to be read in You need a tidy excel file or other file. For excel, the first row should contain suitable variable

names, and all the following rows should contain data. Avoid empty rows and columns.

If you want to read some variables in as numbers, then make sure that there is not any text in any of

the lines of data for that variable (although it is possible that some letter codes might be important

information that you do not want to lose – you could write them into a separate variable). If it is not

read in as a number directly, then you can later on change it into number format.

Some things should be read in as text – this includes long id numbers, because otherwise stata might

round them off, which makes them useless as id variables. Add an extra line of data at the start with

a letter in the relevant column, to make sure it takes this column to be text data. Similarly dates (and

Page 10: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 10 SIDM refers to Stata Intro and Data Management series of workshops

times) should be read in as text, then you will need the date and time commands below to extract

into a proper stata date format.

It is worth checking the number of rows of data, the number of columns, and means and perhaps

even SD’s of some things, back to your source data, to double check that your data has been read in

correctly.

http://www.stata.com/manuals13/u.pdf p327 of stata [U] manual gives more detail on entering

data into Stata, including how to determine which method to use.

3 Viewing your data in Stata, data type and data display formats.

3.1 Variables window This shows your variable names (used in Stata commands) and variable labels (fuller descriptions of

what the variables are). The properties window shows more information on the variables.

3.2 Data editor/ browser window This window is not generally open until you click on the relevant icon, which should be the right

hand one of these: Both these icons open the data editor. Use the left hand one only when

you want to edit the data by typing new values directly into your database (not recommended, this

is not a robust way to amend your data). Use the right hand one routinely to open in browse mode,

which means that you cannot accidentally change your data by leaning on the keyboard or similar.

Variables in black are numeric, in red are string variables, blue indicates categorical numerical data

with variable labels. Some are look like dates/ times and are black (so numeric), which indicates that

they are stored in Stata date/time format. When they look like numbers, but are red (and therefore

stored as strings), Stata will not recognise them as numbers. See data formats below. When they

look like numbers, but are blue, this indicates that at least one value has a value label attached to it;

stata will analyse all values, including any with value labels, as numeric data; the numerical value

taken for categorical variables is that of the underlying value, ignoring anything written in the

value label (see SIDM 1 of 4).

3.3 Browse command This is the same as the browse window, if you don’t specify variables. If you do specify variables, you

will see only the variables specified, which can be helpful. e.g. browse babyid smoke to view only

babyid and smoke variables. Can also combine browse with if statements (see SIDM 1 of 4).

3.4 Describe command This is a great command to use when you read in a new dataset to see what you have. Example:

Page 11: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 11 SIDM refers to Stata Intro and Data Management series of workshops

The above shows the variable names, and variable labels. It also specifies storage type and display

format. For storage type, look which are string variables (starting str); all other types (including byte,

int, float, double) are numeric. Look for value labels, which indicate categorical variables. Look at

format column, particularly for %td to indicate dates, %tc to indicate date/ time. (Note: colours used

in data browser, not here) (see SIDM 1 of 4).

3.5 Further information on data storage types and formats String variables are stored as text, which can include letters, numbers, spaces, special characters, or

sometimes they all look like numbers, even though the storage type is str#.

Variables usually need to be numeric to be used in most analyses; although identifier variables, such

as patient id, can be useful even when stored as strings. There may be a need to convert some

variables from string to numeric data before analysis (see later in this manual). This includes

variables containing date/ time information, which need to be stored numerically, but then can be

formatted as dates, so that you can still read them in a meaningful manner.

Categorical numeric data with variable labels – these have numerical storage type and a label name

listed under “value label” in the above table. Value labels appear in tables and output from

regression analyses, so meaningful labels allow you to remember what information is being stored

and to more easily interpret Stata output. Similarly, the variable labels often appear in Stata output

rather than the variable names, so meaningful variable labels make sure you know/ remember what

the variable contain, and make Stata output easier to interpret.

For details of data types see help data types

And for details of format type help format. Formats affect how data is displayed, but not the

underlying values. Value labels also affect the appearance (in that usually value labels then appear,

rather than numbers).

3.6 Stata Missing Data Codes The missing data code in Stata is . (dot) for numeric data and is blank for string variables. Since

string contents is written in quotes, missing strings are written “” in Stata do files, e.g. gen var1=.

Page 12: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 12 SIDM refers to Stata Intro and Data Management series of workshops

creates var1 as a missing numeric variables, gen var2=”” when var2 is created as a missing string

variable (see SIDM 1 of 4).

3.7 A few basic commands

codebook is really helpful for telling you about missing data, percentiles, ranges of data and

similar and how any categorical variables are coded (see SIDM 1 of 4).

List variables so that they appear in output screen: command is list then variable names

list patientkey registerdate createddate

Output not shown, would be as below but with far more lines of data

Counting the number of observations in our data set, or the number satisfying particular criteria (see

next example also):

count

1883 = stata output number of observations

keep will keep the specified variables, and drop will drop the listed variables.

To summarise all variables in data set: Any string variables appear as missing, so zero(0) observatns.

summ

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

cpmriskscore | 0

patientkey | 1883 1003684 381085.2 652139 2156012

gppractice~e | 0

yearofbirth | 1883 1926.737 5.484465 1910 1936

monthofbirth | 1883 6.62188 3.402567 1 12

Or use codebook for more details, or use summ, detail for more details (including the

median and other centiles, largest and smallest values).

To tabulate data:

tab command, also tab1 (for 1 way tables – can list several variables at a time), tab2 (for 2 way

tables – can list several variables at a time) with “, missing” to check what is happening with missing

values (otherwise easy to overlook them).

tabstat command is somewhat more flexible – see stata help. Also statsby.

Page 13: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 13 SIDM refers to Stata Intro and Data Management series of workshops

3.8 To permanently reorder data: order patientkey age sex weight – you don’t need to list all variables, the ones you

do list will appear first (with order otherwise unchanged).

3.9 To find a lost variable or to explore what variables are available: lookfor date – looks for all variable names and variable labels containing the letters “date”.

Useful with very large datasets, when you want to know what variables might be relevant, e.g. find

any containing chol (for cholesterol) or any containing date.

(and rename – to rename a variable)

4 Saving your work in Stata/ from Stata

The most important aspect of your work is your Stata commands, so save them safety every time

you use Stata into a do file.

4.1 Essential to Create a do file of data commands It is really important to have a good organised file of data commands, getting from original data set

to the version that you use in your analysis. This becomes a record of any recoding decisions you

made. Then if somebody gives you a new version of the data set, or a similar data set, then you can

rerun some/ all the data commands, and very quickly get the new data set into the format that you

want and with any newly created variables that you need. Some people save numerous copies of

their data set, whenever they add a variable or make a minor modification to it. This is unnecessary.

I would save a copy of the data set as originally given to me, or as first input into stata. Then data

sets derived from that can be overwritten as new variables are added, confident that you know how

to re-derive them by rerunning you data do file. If you aren’t confident of how you created your

original data set, or as an additional safety check, you can see if your do file does recreate it by

saving it under a different name and comparing databases.

For greater comprehensibility, add notes to your log file by using * or // at start of line, to tell stata

that it is not code ( // can also be used after the command to add comments). You can also comment

out several lines at a time, or a part of a line, or the end of a line, by using /* text contained in here

and line breaks contained here are ignored by Stata */ (See SIDM 1 of 4)

4.2 Saving Stata Output If you want to record your output as well, either to print it, or to save on your computer, then you

can create and save log files (log using command, or relevant stata icon) (see SIDM 1 of 4).

To copy into word, including Stata graphs if desired: Copy as a picture (select and right click mouse

and select copy as picture) to save into a word file for your own use. This will not generally give you

results in a format that you can directly include into an article, but is useful for your own records.

You can include Stata graphs into your word file (by clicking on edit – copy graph in the Stata

graphics window – then pasting into word).

Page 14: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 14 SIDM refers to Stata Intro and Data Management series of workshops

4.3 Creating tables from stata output To copy data to be used in a table or perhaps in calculations – copy as a table (select with mouse and

right click then select copy as table) and paste into excel – you can then amend it as required. You

need to be careful about selecting whole rows for this to work.

[Within excel, I love the concatenate command “&”, which combines commands into a single cell, so

=D3&”(“&round(100*D3/E3,1)&”%)” might be useful, written perhaps into cell G3 – this quotes the

number in cell D3, and then gives it as a percentage out of the number in cell E3, rounded to 1

decimal place, surrounded by brackets) – so things contained in “ “ are written directly, and other

things are numbers taken from other cells or functions of them. The command looks ugly, but can be

copied down a row, saving you from needing to add brackets and calculate percentages for

yourself]. (See SIDM 4 of 4).

4.4 Saving data in Stata format This can be done using the relevant icon or else with the following command:

Remember it is very helpful to specify the working directory, where files are saved by default, which

avoids the need for files names to include directories – see 2.1. (See SIDM 1 of 4).

4.5 Saving data in Excel format

Page 15: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 15 SIDM refers to Stata Intro and Data Management series of workshops

4.6 Saving data in csv format

4.7 Saving graphs and combining graphs Any graphs created by Stata appear in the graphs window. As soon as another graph is created, this

is overwritten. To save the graph, either click on edit - copy graph and then paste into word or else

save within Stata. It is important to use cd (change directory) command to specify directory before

saving graphs ( see 2.1 of this guide).

hist var1, saving(g1) // saves histogram to a file called g1.gph

hist var1, saving(g1.gph) /// saves histogram to a file called g1.gph – always use .gph extension for

graph (although not necessary to specify in such commands)

hist var1, saving(g1, replace) /// as above, but will overwrite any graph of this name already present

in the directory, without warning.

graph combine g1.gph g2.gph /* combines g1.gph and g2.gph into a single graph, can also be used

to combine more graphs together. In my experience it is usually necessary to specify .gph here. It is

also necessary to use cd command to change to appropriate directory first */

If you want to save graphs in a different format (e.g. .tif might be useful for publications), then in

the graphs window, go to file - save as, and select the appropriate file type under “save as type”.

5 Creating and recoding variables; Putting Data into a Useful format

for Analysis

5.1 Creating new variables with maths operations and other functions

“help generate” gives further details.

See [GSW] manual p85, within stata use help GSW or http://www.stata.com/manuals13/gsw.pdf

gen newvar1=var1 + var2 // sums var1 and var2 and puts total into newly created var called newvar1

gen newvar=3*vara+10*varb // multiplies vara by 3, and varb by 10, then sums them, putting ansr

into a newly created variable called newvar

gen newvar2= cholesterol^2 // this raises cholesterol to the power of 2, and puts result into newly

created variable called newvar2

Page 16: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 16 SIDM refers to Stata Intro and Data Management series of workshops

Can use +, -, / (for divide), * (for multiply), ^ (for to the power) “help operator” gives more details

including order of precedence – use with brackets if necessary

gen logvar1=log(var1) // creates var logvar1 by taking log to base e of var1 [=ln(var1) is equivalent]

gen logvar2=log10(var2) // creates var called logvar2 by taking log to base 10 of var2

gen var2sqrt=sqrt(var2) // creates var2sqrt as square root of var2

log, log10 and sqrt are all examples of functions; help function gives very many more options here.

These are examples of mathematical functions, which might be the most useful in the first place.

I use help function more than any other help command!!! help function !!!

Once you’ve created a new variable, use replace to change values e.g.

replace logvar1=log(0.5) if var1==0 // Overwrites values of logvar1 for some observations, replace

statement usually have “if” statements at the end of them (see next section).

A further command to generate variables:

egen newvar2=rowsum(vara varb) – you might not know egen command but it does lots of

things, e.g. when adding up variables and wanting to average over non-missing value. Help egen for

a list of options available (though more specialised than generate and replace, not necessary for

beginners, saves lots of time in some situations) (see SIDM 3 of 4).

Checking what you’ve created, paying particular attention to missing data:

browse newvar vara varb // look at what you’ve created & at vars you used in the process

summ newvar vara varb // check how much missing data, and ranges of values

tab newvar, miss // tabulate all values, including missing data

5.2 IF statements to restrict commands to only some rows/ observations:

IF statements are often used in creating variables, and can also be added to nearly all commands.

Placing is at the end of any command, without/before comma e.g. "tab var1 if var2==1" "summ

var1 if var==1, detail".

== (2 equals signs) means “evaluate to see whether or not it is equal to”, used in if statements (&

ttest command)

= (single equals sign) means make something equal to, used in generate & replace commands

See [U] manual p76, help u within stata or http://www.stata.com/manuals13/u.pdf for IF statements

count counts the number of observations in the data set, or the number satisfying a condition if

combined with an if statement. (See SIDM 1 and 2 of 4).

count if varname==1 Count the number of obs with varname equal to 1 (note double equals sign to check whether or not it is equal)

tab1 var1 if var2<12, miss Tabulate var1 when var2 is less than 12

Page 17: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 17 SIDM refers to Stata Intro and Data Management series of workshops

count if var1<=12 & var1>=3 Count the number of obs with var1 between 3 and 12 inclusive (NB NOT if 3<=var1<=12, needs to be written in 2 parts)

count if var1=12 | var1=13 counts the number of obs with var1=12 OR var1=13

count if var3~=12 Count the number of obs with var1 not equal to 12

count if var1!=12 Count the number of obs with var1 not equal to 12 (alternative version)

count if var1==. Count the number of obs with missing values for var1 (missing coded as "." in standard way)

count if miss(var1)==1 Count the number of obs with missing values for var1 (missing coded as "." or other system missing value codes)

histogram var2 if var1>=5 , frequency

Produces a histogram of var2 when var1 is greater or equal to 5 or missing (missing values count as infinitely large values)

count if var1>5 Count the number of obs with var1 greater than 5 OR missing (since missing counts as high numbers)

count if var1>=5 & var1!=.

Count the number of obs with var1 greater or equal to 5 and NOT missing

count if ( var1>5 & var1!=. & (var2==2 | var2==3)

Can combine criteria using brackets

count if var1>mdy(6,30,2011)

Counts if the variable, var1, is after30 June 2011 or is missing - mdy is a date function standing for month, day, year (American ordering) - assumes var1 is in Stata date format

count if var1>mdy(6,30,2011) & var1!=.

Counts if the variable, var1, is after30 June 2011 (and Not missing) - assumes var1 is in Stata date format

count if var3==”White”

When variable (var3) is a string variable, need to quote text in brackets – needs to be exact, same capitalisation, same blanks (e.g. “White”, “White “, “white”, white” are all different)

5.3 IN statement to restrict command to only some numbered rows/ observations: add “in 1/5” to end of command, to include only 1st to 5th observations, or add “in 45/50” for

observations 45 to 50 for example. This is added to the end of the command before/without a

comma, as per the IF statement.

list patientkey registerdate createddate in 1/5 // shows first 5 observations

+---------------------------------------------+

| patien~y regi~rdate createddate |

|---------------------------------------------|

1. | 1085212 2011-03-31 2012-05-11 10:57:16 |

2. | 1097655 2005-03-24 2012-03-14 13:05:26 |

3. | 1097655 2005-03-24 2012-03-14 13:05:26 |

4. | 1028983 2005-07-11 |

5. | 1041004 2007-02-26 |

+---------------------------------------------+

Page 18: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 18 SIDM refers to Stata Intro and Data Management series of workshops

5.4 Labelling variable names, creating value labels, labelling data sets Within Stata help label. This is comprehensive coverage of the subject, and also relatively easy to

understand, with some examples and not too much extra/ complex material that you may not use.

However, I summarise here and this may well be sufficient. (See SIDM 1 of 4).

label var gender “Gender of patient”

This may be useful to you and colleagues, to remember what variable you are looking at. May also

be useful when producing graphs, since this automatically gives this as the axis label for some types

of graph, and the variable labels appear in other Stata output too.

tab gender – the new variable label appears in this table

Gender of |

patient | Freq. Percent Cum.

------------+-----------------------------------

1 | 541 28.73 28.73

2 | 1,342 71.27 100.00

------------+-----------------------------------

Total | 1,883 100.00

label define sexlbl 1 “Male” 2 “Female”

This defines a label, and names this new label sexlbl, with the value 2 being labelled as “female” and

the value 1 being labelled as “male”. This is convenient since we often need to store variables

numerically so that we can use them in analyses, and yet the data might naturally be categorical. It is

handy to use “variable labels”, so that we don’t need to keep on looking up some external table

(which in worst case could be lost) to see what each number means.

label values gender sexlbl

This attaches the label (called sexlbl which we have just created) to the variable called gender. We

can attach this same label to many variables if we so choose. We can also create lots of new

variables. The value labels often appear in stata output in place of the numerical values that they are

attached to.

tab gender, miss – the new variable label and value labels (“Male”, “Female”) appear in the

output

Gender of |

patient | Freq. Percent Cum.

------------+-----------------------------------

Male | 541 28.73 28.73

Female | 1,342 71.27 100.00

------------+-----------------------------------

Page 19: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 19 SIDM refers to Stata Intro and Data Management series of workshops

Total | 1,883 100.00

tab gender, miss nolabel – to tabulate numeric values, not their value labels

label list

sexlbl:

1 Male

2 Female

Codebook gender also gives the variable and value labels for sex, so that you can see how they are

coded

5.5 Converting string variables to numeric data (destring) to enable analysis. Useful when you have variables that look like numbers, but stata calls them “string” variables – use

“describe” command to check this out (first command in this document). (See SIDM 2 of 4).

destring cholesterol, gen (chol_mmol)

Stata error message: cholesterol contains nonnumeric characters; no generate

If you have one or a few characters amongst the numbers, then you will get above error message,

but you can still destring taking the number part and ignoring the characters. You need to consider

whether or not this is appropriate, or whether important information might be lost by doing this.

You might possibly want to capture the information in the letters in a separate variable.

destring cholesterol, gen (chol_mmol) ignore(“dfg”)

cholesterol: characters d f g removed; chol_mmol generated as numeric variable

(1874 missing values generated)

If you don’t feel that you can specify the character variables to ignore, and want to destring anyway,

then you can use the force option, and just go ahead. Very important to check what information is

lost by doing this, and if you want this information to be incorporated into this variable, or into

another variable. This is different to the above “ignore” option, in that it sets to missing any cell

which contains both numbers and characters or only characters.

destring cholesterol, gen(chol_mmol1) force

cholesterol contains nonnumeric characters; chol_mmol1 generated as double

(1876 missing values generated)

Crucial step of checking what you have created:

browse cholesterol col_mmol1 // to visually check the new variable

sum chol_mmol1 // to check range of values and for missing data

list cholesterol chol_mmol chol_mmol1 in 1/5 // this lists the first 5 observations

Page 20: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 20 SIDM refers to Stata Intro and Data Management series of workshops

+--------------------------------+

| choles~l chol_m~l chol_m~1 |

|--------------------------------|

1. | 4.2 4.2 4.2 |

2. | 3.6 3.6 3.6 |

3. | 6.5 6.5 6.5 |

4. | 5.3dffg 5.3 . |

5. | 6.9fff 6.9 . |

+--------------------------------+

You would want to check more data than this to see if lost information is important. If the

information that you lose is important, then try “help function” in stata and selecting “string

functions”, and reading through possibly options to see if you can find a useful one.

To extract non-numerical information from a string variable, or other more specific numerical

information, search “function” and click on “string functions”. Read down the list of functions to find

an appropriate one. In addition, the two commands below may be useful:

split, separate – separates string variables into component parts

5.6 Converting string into labelled categorical data (encode) & vice versa (decode)

Another example is where you have a string variable, for example “sex” with different categories

written as text/ letters, e.g. the following coding. Note that “F” and “f” are treated as being

different. Here there are 4 codings for female and 2 for male. (See SIDM 2 of 4).

tab sex

sex | Freq. Percent Cum.

------------+-----------------------------------

F | 1 5.56 5.56

Femal | 1 5.56 11.11

Female | 5 27.78 38.89

M | 4 22.22 61.11

Male | 4 22.22 83.33

f | 3 16.67 100.00

------------+-----------------------------------

Page 21: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 21 SIDM refers to Stata Intro and Data Management series of workshops

Total | 18 100.00

encode sex, gen(sex2)

This generates a new numerical variable, sex2, from the string variable, sex. It decides upon a

numeric coding for each possible text response for sex. It also labels them (with value labels) so that,

although stored internally within stata as number, we again see them as corresponding text in our

output. To check what has happened, use the tab command and the label list command as follows:

Checking how new variable compares to old:

tab sex sex2, miss – including any with missing observations on either variable

with “,miss” option. This is important to include, to get the full picture.

| sex2

sex | F Femal Female M Male f . | Total

-----------+------------------------------------------------------------------+----------

| 0 0 0 0 0 0 1,865 | 1

F | 1 0 0 0 0 0 0 | 1

Femal | 0 1 0 0 0 0 0 | 1

Female | 0 0 5 0 0 0 0 | 5

M | 0 0 0 4 0 0 0 | 4

Male | 0 0 0 0 4 0 0 | 4

f | 0 0 0 0 0 3 0 | 3

-----------+------------------------------------------------------------------+----------

Total | 1 1 5 4 4 3 1,865 | 1,883

label list – to see what the numerical codes are underlying the value labels used for

sex2 variable (& for other variables too)

sex2:

1 F

2 Femal

3 Female

4 M

5 Male

6 f

sexlbl:

1 Male

Page 22: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 22 SIDM refers to Stata Intro and Data Management series of workshops

2 Female

decode goes back from numeric data to string, i.e. it is the inverse of encode.

5.7 Tidying up strings before encoding; further string functions If you want to tidy up your string function before trying to recode, then search for “functions”, click

on “string functions”, and look for ones which remove or limit spaces, and that convert to lower case

throughout and similar. (See SIDM 2 of 4).

Alternative commands, rather than using encode and recode above, this might be simpler

alternative method to the above coding;

gen sex3=2 if sex==”Female” | sex==”Femal” | sex==”F” | sex==”f”

replace sex3=1 if sex==”Male” | sex==”M”

label values sex2 sexlbl

Crucial step of checking the results:

tab sex3 sex, miss // remember to check what you’ve just created, including right amount of

missing data (or browse sex3 sex is also useful)

5.8 Recoding numeric & categorical variables Recode is useful for changing/ amending values or ranges of values; help recode will give you code

to change values, ranges of values and missing data. (See SIDM 2 of 4).

This continues the above example.

recode sex2 (1=2) (3=2) (4=1) (5=1) (6=2), gen(sex3) – this is one way of

recoding values of sex2, and creating a new variable called sex3. This appropriately combines the 7

categories of sex2 variable into 2 categories. (1=2) indicates that value 1 on sex2 is recoded to value

2 on the new variable sex3, and so on.

(17 differences between sex2 and sex3)

label values sex3 sexlbl – this attaches value labels to the newly created variable sex3,

so that we can see which is male and which is female (as given in the following table).

tab sex3 sex, miss – as above essential to check how new variable is calculated, including

treatment of missing values with “,miss” option.

| RECODE of sex2

sex | Male Female . | Total

-----------+---------------------------------+----------

| 0 0 1,865 | 1,865

F | 0 1 0 | 1

Femal | 0 1 0 | 1

Page 23: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 23 SIDM refers to Stata Intro and Data Management series of workshops

Female | 0 5 0 | 5

M | 4 0 0 | 4

Male | 4 0 0 | 4

f | 0 3 0 | 3

-----------+---------------------------------+----------

Total | 8 10 1,865 | 1,883

Crucial step of checking the results:

tab sex3 sex, miss // remember to check what you’ve just created, including right amount of

missing data (or browse sex3 sex is also useful)

5.9 Categorical variables from numerical data (e.g. quintiles, specified cut-offs) It is useful to be able to create categories from continuous data. We might often want to use the

median, quartiles or quintiles to create these groups. We might alternatively want to use recognised

cut-offs which may have clinical significance to create the groups (e.g. for BMI, use recognised

thresholds to denote underweight, normal weight, overweight, obese). (See SIDM 2 of 4).

centile var1, centile(20, 40, 60, 80) // prints out the centiles of var1

pctile var3 = var1, n(5) // creates a new variables “var3” which contains 4 values, which are the cut...

-offs for the quintiles

xtile varq5=var1, n(5) // creates a new variable “varq5” divided into 5 equal groups, so by quintile

xtile varq5=var1, cut(var3) // creates a new variable “varq5” divided into 5 equal groups, so by

quintile, but based on values now stored in var3 (which were quintile in command above)

xtile varq4=var1, n(4) // creates a new variable “varq4” divided into 4 equal groups, so into quarters

egen var2=cut(var1), at(0, 17, 23, 47, 51, 62) /* specifies numeric values in unit of measurement of

specified variable (not centiles) - here 0 is the minimum value, the next values are for the 20th, 40th

… percentile and 62 exceeds the max value so that all values within this range are non-missing on

var2 as well as on var1 */

Crucial step of checking whichever variables you’ve just created:

tab varq5 // whichever variable you’ve just created, remember to check it, to see how equal

quintiles are and to see whether there is the correct amount of missing data

browse var1 varq5 // another good way to check visually especially for small data sets

tabstat var1, stat(min max) by(varq5) // a thorough checking of the variable just created

The above all aim to divide the data into equally sized groups, but with small data sets or

when many people have the same value (for var1) then this might not be possible, so groups

can be very uneven in size.

Page 24: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 24 SIDM refers to Stata Intro and Data Management series of workshops

Note that both egen and xtile, with cut options, can be used to divide data into groups using cut-

points of your own choosing, and not necessarily aiming for equally sized groups (e.g. for BMI, using

clinically recognised cut-offs, or age into 10 year age bands).

5.10 Creating a new variable counting the observation number, or repeat number (optionally observation within person or other grouping) (See SIDM 3 of 4).

gen nnn=_n – this creates a variable which has the same value as the observation number

printed down LHS of data when browsed. You might want to sort appropriately first.

gen cnt=_N – this creates a variable which gives the total number of observations when it is

created, taking the same value for each observation.

sort patientkey time

by patientkey: gen obsno=_n – creates a variable called obsno which counts the

observation in each person (defined by variable patientkey) from the earliest time (obsno=1) to the

latest time for that person. Really useful for repeated measures data, after sorting by patientkey

and date.

browse patientkey time nnn cnt // check what is created

tab cnt // checks how many observations have the same patientkeys

6 Checking data for errors: the first essential part of any data analysis

Unless you do this step, then there is a high chance that your analyses will be meaningless. Some

example commands are given in brackets here as a guide, but details of syntax are explained later.

6.1 Looking out for missing data in stata The summarize command will tell you how many values are not missing, for each variable (and count

and descr says how many observations there are in total in the dataset). summarize also tells you

the range of values. It is essential to look for outlying values, particularly to see if these are plausible.

Do you have any values such as 999, -1 or -99999, which might be missing data codes? If you don’t

spot these and recode to something that Stata will recognise as missing, then your results will be

wrong (e.g. replace var1=. if var1==999). (See SIDM 2 of 4).

e.g. if missing cholesterol is initially coded as 999, need to recode to missing as follows, so that stata

recognises it as missing data. Remember that . (dot) is Stata code for missing numerical data.

replace cholesterol=. if cholesterol==999 - so if missing cholesterol is initially

coded as 999, need to recode to missing value as recognised by stata with this command

count if cholesterol==. – counts how many observations have missing values for

cholesterol (i.e. coded to ., the missing value code recognised by stata).

count if cholesterol>4.5 –count observations where cholesterol is above this value OR

MISSING, since missing numbers count as plus infinity, i.e. bigger than all real numbers

Page 25: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 25 SIDM refers to Stata Intro and Data Management series of workshops

count if cholesterol>4.5 & cholesterol!=. –count observations where

cholesterol is above this value and NOT missing

count if cholesterol<3 – count if cholesterol is below this value

count if GPpractice==”” – the missing value for string variables is given by “”, so this

counts number with missing data for this string variable GPpractice.

count if missing(GPpractice)==1 – alternative way to find the number missing (which

includes anything coding with Stata missing codes as well as . – it is possible to distinguish between

different types of missing using different missing codes if you want)

count if missing(cholesterol)==1 – alternative way to find the number missing as

above.

[“help missing“ gives more detail on missing values in Stata]

Misstable (options are to type misstable summarize, misstable nested, misstable tree or misstable

pattern) will tabulate missing data for you. Use stata help for details.

“help mvencode” gives a command that changes missing values to numeric and vice versa.

Are missing data getting in the way of you being able to create the mean of a few variables, or SD,

or other summary statistic? (where you want one mean value per patient, or averages across

patients). Then try help egen which has functions which give these summary measures, excluding

missing data.

6.2 Consider if other values need to be recoded to missing Do you need to recode any other data as missing before beginning your analysis e.g. negative

numbers? Numbers that are implausibly high? Remember that having values for a few patients that

are much larger than for most patients is to be expected for some variables. Don’t omit values just

because they are outliers. Take clinical advice on what values are plausible, even if well outside the

normal range. It is rarely justified to present results without genuine outlying values. If you do, then

report that you have done so, and state what outlying values were omitted, and also give results

including them in. Performing analyses with and without outlying values is a good check of the

robustness of the results.

6.3 Look out for inconsistencies in units You might want to tabulate your data and/or draw a histogram, to check that the distribution looks

reasonable. Or there might be evidence that some values are in one unit and some are in another

unit (because of 2 separate overlying peaks on the histogram, bimodal distributions).

6.4 Look out for undefined category values For categorical values, are all values plausible, or are some erroneous, i.e. they do not correspond to

any pre-defined category? If so, you may need to change these to missing or recode to the correct

category. (tab var1 gives frequencies against the value labels – if it also gives frequencies for simple

numerical values, e.g. 11 below, this might indicate that they should be recoded to missing (e.g.

replace gender=. if gender==11, or to the correct category if known (unless you think that numerical

values without a label are valid for this variable e.g. number of children coded as 0, 1, 2, 3 and then a

labelled category “4 or more”). Might also be useful to recode the data.

Page 26: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 26 SIDM refers to Stata Intro and Data Management series of workshops

6.5 Missing Dates: do you have many dates the same and you don’t know why? Missing dates sometimes appears as a constant date, usually and hopefully outside the plausible

range of genuine dates in your data set. The same can apply to other numerical variables.

6.6 Check for consistencies between variables Cross tabulate your variables (tab var1 var2), to check for inconsistencies. E.g. do you have any

pregnant men?? Do you have people who are apparently never-smokers, where number of

cigarettes per day is 20? Which do you believe, or which do you choose to do with? You may want to

recode for consistency, although it is important to note what is done here, and to also keep the

original variables. (See SIDM 2 of 4).

Are dates at entry always before dates at visit 1, and are they always prior to dates at visit 2?

6.7 Check inconsistencies and outlying values to the original data source This is ideal, to correct any possible errors, but not always possible.

6.8 To check for duplicate people/ observations in your data set: The command isid var1 is a simpler way of determining whether or not var1’s values are

unique (or isid var1 var2 to check if this combination of variables is unique).

Search “help duplicates” gives details of a command that you can use for dealing with duplicates in

your data set.

6.9 Check every new variable that you create This is an essential part of creating a new variable, as well as labelling it appropriately so that we

know what it is, and storing the command that creates it into our do file of data commands.

7 Dates and times

These are usually read into stata originally as strings, but you need to convert to stata date or time

format so that stata can order them appropriately and use them in calculations:

7.1 Create a new date variable recognised by stata in calculations as date i.e. stored internally as a date, from the string variable registerdate. The string variable looks like

dates when viewed by us, it is just that stata treats it as purely a string variable, would sort it only

numero-alphabetically and would not use it in calculations. (See SIDM 2 of 4).

Page 27: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 27 SIDM refers to Stata Intro and Data Management series of workshops

gen registerdate1=date(registerdate, "YMD")

“YMD” indicates Years then Month then Days are given currently in existing variable registerdate.

Change the order of these 3 letters to reflect the ordering of Day, Month, Year in your string

variable. Now list or browse to see what new variable looks like:

list patientkey registerdate registerdate1 in 1/3

+----------------------------------+

| patien~y regi~rdate regist~1 |

|----------------------------------|

1. | 1085212 2011-03-31 18717 | The new “date variable” appears as

numbers, which

2. | 1097655 2005-03-24 16519 | do not immediately mean anything to us

3. | 1097655 2005-03-24 16519 | – they are meaningful to stata, and

represent number

+----------------------------------+ of days from a defined starting date

Now that we have a variable that stata recognises as a date, we can nevertheless change its format,

i.e. the way that it is viewed by us, so that we can read it as dates again. Yet it is still stored internally

by stata as proper date format, so we can use it in calculations or sort by it. Now list or browse to

see what new variable looks like:

format registerdate1 %td to change the way we view this variable

list patientkey registerdate registerdate1 in 1/3

+-----------------------------------+

| patien~y regi~rdate registe~1 |

|-----------------------------------| The new “date variable” created

above now appears

1. | 1085212 2011-03-31 31mar2011 | as recognisable dates, and we need

to visually

2. | 1097655 2005-03-24 24mar2005 | inspect a few of them to check that

they agree with

3. | 1097655 2005-03-24 24mar2005 | the original string date variable

+-----------------------------------+

Page 28: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 28 SIDM refers to Stata Intro and Data Management series of workshops

gen registerdate1=date(registerdate, "DMY”) for the more usual way of writing dates in UK as

day, month, year, with 4 digit years.

gen date=date(date_str,"MD20Y") for a common way of writing dates in UK as day, month, year, with 2 digit years, when all dates occur after 2000.

gen date=date(date_str,"MD19Y") for a common way of writing dates in UK as day, month, year, with 2 digit years, when all dates occur after 1900 (in 20th century).

This tells you how to deal with a mixture of 2 and 4 digit years:

http://www.ats.ucla.edu/stat/stata/faq/date_year.htm

7.2 Create a date and time variable that stata recognises as such from a string

variable The variable createddate has time as well as date, and so we need to convert it to a variable which

stata recognises internally as indicating date and time (using the “clock” function to get a new

variable called createddate1). All the data is then stored as very large numbers (which indicate

number of seconds from midnight on specified date). So that we can see them in an understandable

format, use the “%tc” format, which changes the way they appear to us. (See SIDM 2 of 4).

gen createddate1=clock(createddate, "YMD hms")

where “YMD hms” refers to order of Years, Month, Days and of hours minutes and seconds as

written in the already existing string variable. Change order of these letters as necessary.

format createddate1 %tc

7.3 Convert from clock format to standard date format we are not usually interested in time of visit or similar so accurately that we are interested in times

as well as dates. It is easier to work with variables in date format for most purposes, so change from

clock format to date format as follows, to create a new variable called createddate2:

gen createddate2=dofc(createddate1)

format createddate2 %td

list patientkey createddate createddate1 createddate2 in 1/3

+-----------------------------------------------------------------+

| patien~y createddate createddate1 created~2 |

|-----------------------------------------------------------------|

1. | 1085212 2012-05-11 10:57:16 11may2012 10:56:47 11may2012 |

2. | 1097655 2012-03-14 13:05:26 14mar2012 13:04:25 14mar2012 |

Page 29: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 29 SIDM refers to Stata Intro and Data Management series of workshops

3. | 1097655 2012-03-14 13:05:26 14mar2012 13:04:25 14mar2012 |

+-----------------------------------------------------------------+

The above shows our 2 newly created variables, with their new easy to read formats, so that we can

check a few to see that stata has done what we intended it to do. They are nevertheless in proper

stata date and clock formats, so have appropriate numerical values attached for use in sorting and in

calculations.

descr patientkey createddate createddate1 createddate2

storage display value

variable name type format label variable label

----------------------------------------------------------------------------

patientkey long %12.0g PatientKey

createddate str19 %19s CreatedDate

createddate1 float %tc

createddate2 float %td

The stata clock and date variables are stored as standard number variables (“float” storage type”),

and the display format indicates dates (%td) and clock formats (%tc).

7.4 Comparing a date variable with a definitive data that we specify: count if createddate2<mdy(12,31,2011)

This counts the number of observations with createddate2 earlier than 31 December 2011.

The function mdy allows us to specify a value for month, day and then year (using this American

ordering – we can’t change the order of mdy here), so that we do not need to know the stata

numeric value corresponding to the date of interest. (See SIDM 2 of 4).

We can also use this function to construct dates from separate variables which give the month, day

and year as follows:

gen birthdate=mdy(birthmonth, birthday, birthyear)

7.5 Create a stata date variable from 3 separate variables for month, year and day By using variables names instead of numbers in the above mdy function, and using the generate

“gen” command). (See SIDM 2 of 4).

7.6 Finding the time elapsed between 2 stata date variables gen timelapsed_days = createddate2- registerdate1

Page 30: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 30 SIDM refers to Stata Intro and Data Management series of workshops

(869 missing values generated) – this will occur if either date variable is missing

This gives time between the 2 specified dates (subtracting one variable name from the other), in

days.

gen timelapsed_yrs =( createddate2- registerdate1)/365.25

(869 missing values generated)

This gives time between the 2 specified dates (subtracting one variable name from the other), in

years.

summ timelapsed_days timelapsed_yrs

Variable | Obs Mean Std. Dev. Min Max

-------------+--------------------------------------------------------

timelapse~ys | 1014 4495.933 4371.45 -1 23293

timelapse~rs | 1014 12.30919 11.96838 -.0027379 63.77276

list timelapsed_yrs timelapsed_days createddate2 registerdate1 in 1/3

+---------------------------------------------+

| timel~rs timel~ys created~2 registe~1 |

|---------------------------------------------|

1. | 1.114305 407 11may2012 31mar2011 |

2. | 6.973306 2547 14mar2012 24mar2005 |

3. | 6.973306 2547 14mar2012 24mar2005 |

+---------------------------------------------+

Visually inspect the above table to check if the results look reasonable. Always check visually and

with one or two commands when you create a new variable, to check if the results look okay. This

can save embarrassment later on.

7.7 Extracting month or year from stata date variable and generating new variables gen eventmonth=month(fulldateeventdate11) – new variable with just the month

gen eventyear=year(fulldateeventdate11) – new variable with the year

tab1 eventmonth – checking what it has done – also browse data to check further

tab1 eventyear – checking what it has done – also browse data to check further

gen eventmnths=eventyear*12+eventmonth – combining years and months by simple

arithmetic to get new variable

Page 31: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 31 SIDM refers to Stata Intro and Data Management series of workshops

tab eventmnth – remember to check again to see that new variable is what you really are

wanting

Other functions for dates and time, to find time elapsed between, extract days, minutes, hours and

similar.

Within stata, search “functions” and click on date and time functions. Read through the various

options yourself.

8 Merging data sets and changing shape and size of data set

These are taught in more detail in “SIDM 3 of 4: Merging and reshaping Data and functions in

Stata” word document.

Use stata help to find details on these commands, by searching for data management for a more

extensive list.

8.1 Combining datasets merge – This is used to combine 2 data sets, when you primarily want to add variables to your data

set. New lines of data might also be added in some cases.

append – this is used to combine 2 data sets, when you primarily want to add new observations onto

the end of your data, i.e. add new lines of data at the bottom.

8.2 Reshaping datasets reshape – this is when you have several lines of data per patient (or per family or similar), and you

want a data set with just one line per patient/ per family. If you have unique values for cholesterol

on each line of data, then the new data set will have duplicate “cholesterol” variables, the number of

such variables is given by the maximum number of observations per patient/ per family in the

original data base. If variables values differ with patients/ families, then you might want to drop it

before doing the reshape command if you will not need it later on. Alternatively, list ALL variables

which are not constant within patient/ family in the first part of the command to avoid an error

message.

reshape – also does the opposite to the above. If you have just one line of data per patient, with

repeated measurements of cholesterol (or other variable) named chol1, cho2, chol3, chol4 (or

similar – any name with consecutive numbers on the end). Then you can get a data set with several

lines of data per patient, one line for each repeated value of cholesterol.

collapse – this is used when you have repeated lines of data for each patient/ GPpractice or

similar, and you want just one line per patient or per GPpractice. You want this line of data to

contain means, min, max or other summary statistics (in above example, it would not give several

cholesterol variables for each patient, it would give just one, which could contain the mean, or

possibly an additional one with also the SD or sample size or whatever you choose).

Page 32: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 32 SIDM refers to Stata Intro and Data Management series of workshops

8.3 Creating variables of summary statistics egen command, combined with by, will give summary statistics, as per the collapse command.

However, this is just a function which retains all the original lines of data in the data set, but can add

a new variable with summary statistics, either summarises across all the data, or summarised by

patient (by using by patient: egen command).

If you are dealing with data with several lines of data per person (where above commands may be

helpful), then it might be worth reminding you of explicit subscripting (see 9.10, which is useful if

you want to carry a previous person’s data value forward (with data in “long” format).

http://www.stata.com/manuals13/d.pdf

[D] Data management Stata manual gives more information on these

commands, and lists in its contents other commands that might be useful

8.4 Managing files: comparing datasets and comparing variables

cf is a stata command that you can use to compare databases (see help cf), so that you can see if

they have many of the same variables, and if variables take the same values. compare is a command

that compares two variables.

9 Other Useful Information

9.1 To use stata as a calculator: Use display or di command: di “hello” /* this just print again whatever is in the double quotes */

hello

di 4+5 /* this performs a sum and displays the result */

9

9.2 Restricting commands to run on only a subset of the data Use if or in statements at the end of the main part of the command, but before the comma (i.e.

before any options). See 5.2 and 5.3 abovefor details of syntax.

9.3 Tabulating your data Stata Channel Video Instructions have produced videos to teach Stata table commands. The first is the simplest, good if you are new to Stata. The last is the most complex. In Stata 13, links to these appear at the bottom of Stata help. Tables and cross tabulations in stata https://www.youtube.com/watch?v=3WpMRtTNZsw Descriptive statistics in Stata (including tabstat)

Page 33: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 33 SIDM refers to Stata Intro and Data Management series of workshops

https://www.youtube.com/watch?v=kKFbnEWwa2s Combining cross tabulations and descriptive statistics in stata: https://www.youtube.com/watch?v=Dzg6AMSt10w See help misstable for tables on missing data

9.4 To get result from your analysis in easy to use format for producing tables statsby collects statistics from command across a by list. Typing:

statsby exp_list , by(varname): command

executes command for each group identified by varname, building a dataset of the associated values

from the expressions in exp_list. The resulting dataset replaces the current dataset, unless the

saving() option is supplied. varname can refer to a numeric or a string variable.

command defines the statistical command to be executed. Most Stata commands and user-written

programs can be used with statsby, as long as they follow standard Stata syntax and allow the if

qualifier. The by prefix cannot be part of command.

Look again at start of this document for tips of transferring into excel and working with excel.

9.5 Summary statistics giving incidence rates and SMR and corresponding results for

case-control studies stptime -- Calculate person-time, incidence rates, and SMR

ir – gives incidence rates from cohort study data with variable f/u times

cs – gives cumulative incidence from cohort study data where all have the same f/u time.

See help epitab for relevant commands for case-control studies, including for matched case-control

studies (generally reporting odds and odds ratios).

9.6 Generating random numbers gen randvar=runiform() – this generates a random number from the “uniform” distribution,

from 0 to 1, so around 50% of observations are between 0 and 0.5, the remainder are between 0.5

and 1. Around 20% are between 0 and 0.2, between 0.2 and 0.4, or between 0.8 and 1, and so on.

In stata, “help function”, and click on “random-number functions”. You can generate random

numbers from many distributions. This may be useful if you are simulating data for any reason. It is

very useful to check that you get similar results when you rerun your analysis on a second set of

simulated data. However, it is nevertheless useful sometimes to make sure that you always get the

same results from your simulations, by making sure the random numbers generated are always the

same. Stata has an internal random number list. To always start at the same point in this list, use the

set seed command, with an arbitrary number that you can choose. Also save commands in a do file,

so that you can repeat them exactly.

set seed 3214

Page 34: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 34 SIDM refers to Stata Intro and Data Management series of workshops

gen randvar=runiform()

Other relevant commands to look up in stata help:

corr2data – to create data with specified correlation structure

drawnorm – to draw data from the multivariate normal distribution

sample – to draw a random sample of the data in memory

set obs 1000 – this may also be relevant here, to increase the number of observations in your

data set to 1000 (or specify another value)

9.7 To get p-values from probability distributions & vice versa: You need knowledge of probability distribution. You need to take account of whether you want one

or two sided p-values, and of which side is relevant, and if two-sided is ever relevant for your

specified distribution. In stata, “Help function”, click on “density function” for a list of options here

for use with the generate (“gen”) command. Usually these are given in stata output, so there might

not be any need to do this yourself with these commands.

9.8 Accessing saved results of commands Many commands save some of the results produced. To access these, type either return list or

ereturn list, and hopefully at least one of these will contain results. Not all commands have saved

results. If they do, then results are available until the next Stata command is run that also saves

results. See SIDM 4 of 4 for guidance on how to use these, for instance within loops that extra

results of many commands. If you want to save summary statistics, then the egen command may be

a simpler option.

9.9 Loops in Stata Using loops is a way of avoiding writing lots of repetitive code, when doing the same thing on many

different variables, or for various different numbers, using foreach or forvalues loops. See SIDM 4 of

4 for guidance on how to use these.

9.10 Explicit subscripting: check for duplicates, adding or subtracting data from

different rows of your dataset

See 6.8 for an easier way to check for duplicates. Explicit subscripting is more versatile, so given here

in case it is useful to you. You need a unique identifier to denote person (here the variable name

used is var1). Sort by this identifier, then count any times when the same identifier appears twice in

a row:

sort var1

count if var1==var1[_n-1] // after sorting, will count duplicate values of var1 (patientkey)

Page 35: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 35 SIDM refers to Stata Intro and Data Management series of workshops

list if patientkey==patientkey[_n-1] | patientkey==patientkey[_n+1] // after sorting, will list all

variables in lines of data which have the same patientkey as another line of data.

The above command is an example using “explicit subscripting”. It can also be written “count if

var1[_n]==var1[_n-1]” – where [_n] denotes the nth line of data in your data set, so this compares

var1 for the nth line of data with var1 on the (n-1)th line of data, i.e. with the row before. This

comparison is made for all n, from 1 to N, where N=number of observations in your data set.

Compare this to a more common command which uses “implicit subscripting”, which means we

don’t need to give the subscripting. e.g. “count if sex=gender”, which is equivalent to “count if

sex[_n]==gender[_n]”, which compares the nth line of data for sex with the nth line of data for

gender, for n=1,2,3… N. In other words, for each line of data, it looks to see if sex and gender are the

same.

Search “help duplicates” gives details of a command that you can use for dealing with duplicates in

your data set.

The collapse command might be useful to combine data from more than one row, both of which

represent the same person.

9.11 Using Stata commands from the internet that are not pre-installed on Stata See How can I use the findit command to search for programs and get additional help?

You might initially search for a command that you think might exist, or for the name of a statistical

procedure or similar, if you don’t know the command name that you are looking for.

10 Further resources for learning Stata

10.1 Excel spread sheet of commands This is designed as a quick reference, to remind you of the basics, without the need to look through

detailed resources.

10.2 Interactive workshops on the material in this Stata data management guide To accompany this guide, there is a series of 4 workshops, called “Stata introduction and Data

Management”, which have interactive exercises on this material. This is sometimes taught face to

face. This material is available on blackboard, introduction to statistical thinking and data analysis

module of the MPH/ MSc Epi course. Imperial Staff/ PhD students can sign up to gain access to this

material.

10.3 Interactive workshops to teach Statistical Analysis This material is available on blackboard, introduction to statistical thinking and data analysis module

of the MPH/ MSc Epi course. Imperial Staff/ PhD students can sign up to gain access to this material.

10.4 Stata teaching videos

http://www.youtube.com/user/statacorp You can also search youtube for vidoes that other people have uploaded.

Page 36: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 36 SIDM refers to Stata Intro and Data Management series of workshops

10.5 Stata help Great to learn how specific commands work – look particularly at options for any that might be

relevant and at the examples near the end of the help file.

help data management gives a list of many commands that may be useful to you.

Remember help function is very useful.

10.6 Stata Manuals These are an excellent resource, for not only learning Stata but also for learning Statistics. They are

available online and also as pdf’s accessible from within Stata help. Within help files, links to

manuals are written [GSW], [D], [U] and similar. [GSW] is the most basic of these manuals for

beginners.

10.7 Stata search This searches for stata resources on the internet as well as within stata, good when you do not know

command names to find suitable commands and information. You can potentially find Stata

commands that are available on the web, but that are not pre-installed with Stata. It is not wise to

rely too much on googling and stata searches, rather than manuals or this document or books, for

data management.

10.8 UCLA website Very useful for learning to do specific analyses using Stata, very good resource. Also excellent for

teaching Statistics. http://www.ats.ucla.edu/stat/stata/

They have a starter kit for people who are new to Stata.

http://www.ats.ucla.edu/stat/stata/sk/default.htm

10.9 Stata Programs for Teaching Statistics This page describes Stata programs for teaching. You can download any of these programs from

within Stata using the findit command. For example, to download the cidemo command you can

type findit cidemo (see How can I use the findit command to search for programs and get additional

help? for more information about using findit).

10.10 Resource if you want to create your own commands This is advanced stuff and takes quite a lot of learning. Most people never do this.

http://www.ats.ucla.edu/stat/stata/ado/

10.11 Books for learning Stata Reading books is an excellent way to learn stata. There are many good books, you will find a few in

the library.

http://www.stata.com/bookstore/books-on-stata/

10.12 Internet searches Searching the internet can produce lots of useful information. However, it is not effective for

learning strategies of data management. You might not know what to search for. If this approach

Page 37: Stata Data Management - imperial.ac.uk Data... · Stata Data Management Hilary Watt 1 SIDM refers to Stata Intro and Data Management series of workshops Stata Data Management This

Stata Data Management

Hilary Watt 37 SIDM refers to Stata Intro and Data Management series of workshops

does not reveal your answer quickly, I suggest you use an alternative approach, such as a book or

guide, Stata manuals or UCLA website.