Stata Andrew

32
String functions Search for a word in a string: gen x = regexm(var1,"RSV") searches var1 for direct match of words "RSV" with 1 if yes and 0 if n gen x1 = regexs(1) if regexm(name, "([a-zA-Z]+)[ ]*([a-zA-Z]+)") Extract 1st word gen x2 = regexs(2) if regexm(name, "(([a-zA-Z]+)[ ]*([a-zA-Z]+))") Extract 2nd wor gen x=strpos(var1,"Dr") >0 codes 1 if "Dr" appears in variable, 0 if not gen strpos("stata","a") returns 3 (string position of letter a) For tips on regexm code: http://www.ats.ucla.edu/stat/stata/faq/regex.htm : how to extract complex numbers o Extract dates from strings regexs(0) if regexm(date,"[0-9]*$") extracts year from a stata date field (%td format 23jan2009) regexs(0) if regexm(date,"[a-zA-Z]+" extracts month from 23jan2009 regexs(0) if regexm(date,"^[0-9]+") extracts day from 23jan2009 Split a string into many variables according to defined split point eg: "Diaganosis:c "cardiac" split var1,p(:) limit(2) splits string "var1" at colon ":" in the string fo "limit") Extract letters from string substr(var1,1) chooses first letter of string substr(var1,1,2) chooses first 2 letters of string substr(var1,-1) chooses last letter of string substring (name,1,comma-1)…extracts from name to first comma substr("abcdef",2,3) gives "bcd" substr("abcdef",-3,2) gives "de" substr("abcdef",2,.) gives "bcdef" substr("abcdef",-3,.) = "def" Isolate parts of string gen x = strpos(diagnosis,"RSV") >0 codes 1 for any time the string "RSV" appears least once it will be 1) regexm(var1,"RSV") codes 1 if word "RSV" appears in var1 string Join strings together (needs egenmore.ado) Concatenate egen initials = concat(x1 x2) egen newvar = concat (x y z), punct(-) ..creates x-y-z, use any punctaution chara Covert String to number encode ethnic_str, gen(ethnic) Extract date (12/03/2009) from a long date format (12/03/2009 00:00:00)

Transcript of Stata Andrew

Page 1: Stata Andrew

String functions

Search for a word in a string: 

gen x = regexm(var1,"RSV")     searches var1 for direct match of words "RSV" with 1 if yes and 0 if no

gen x1 = regexs(1) if regexm(name, "([a-zA-Z]+)[ ]*([a-zA-Z]+)") Extract 1st word from string "name"

gen x2 = regexs(2) if regexm(name, "(([a-zA-Z]+)[ ]*([a-zA-Z]+))") Extract 2nd word from string "name" gen x=strpos(var1,"Dr") >0  codes 1 if "Dr" appears in variable, 0 if not

gen strpos("stata","a") returns 3 (string position of letter a)

For tips on regexm code: http://www.ats.ucla.edu/stat/stata/faq/regex.htm : how to extract complex numbers or letters from string

 

Extract dates from strings regexs(0) if regexm(date,"[0-9]*$") extracts year from a stata date field (%td format 23jan2009)

regexs(0) if regexm(date,"[a-zA-Z]+"   extracts month from 23jan2009

regexs(0) if regexm(date,"^[0-9]+")  extracts day from 23jan2009

 

  Split a string into many variables according to defined split point  eg: "Diaganosis:cardiac" to "Diagnosis" "cardiac"

split var1,p(:) limit(2)       splits string "var1" at  colon ":" in the string for 2 occasions (defined by "limit")

 

Extract letters from string

substr(var1,1)  chooses first letter of string

substr(var1,1,2)  chooses first 2 letters of string

substr(var1,-1) chooses last letter of string

substring (name,1,comma-1)…extracts from name to first comma

substr("abcdef",2,3) gives "bcd"

substr("abcdef",-3,2)  gives "de"

substr("abcdef",2,.)  gives "bcdef" substr("abcdef",-3,.) = "def"

 

Isolate parts of string

gen x = strpos(diagnosis,"RSV") >0  codes 1 for any time the string "RSV" appears in the field (if it appears at least once it will be 1)

 regexm(var1,"RSV")  codes 1 if word "RSV" appears in var1 string 

 

Join strings together (needs egenmore.ado) Concatenate

egen initials = concat(x1 x2)

egen newvar = concat (x y z), punct(-)  ..creates x-y-z, use any punctaution character in brackets of punct()

Covert String to number  encode  ethnic_str,  gen(ethnic)  

Extract date (12/03/2009)  from a long date format (12/03/2009 00:00:00) gen x=substr(date,1,10)

 Replace  replace myvar = myvar[ n-1] if missing(myvar)  ....replace missing value with previous value

Page 2: Stata Andrew

replace x= round(x,1)......round a value to 1 decimal place replace x = 1 if _n<= 100 ...replace first 100 values 

replace x = 1 if (_N - _n) < 100....replace last 100 values 

replace x = 1 in 4 ....replaces 4th value in x with 1

replace x=1 in 5/25  ....replaces from value 5 to 25

 

Stats: Descriptive

summarise var1.........this gives mean

summarise var1,detail (this gives full breakdown of stats such as mean, SD, min, mx etc)

values are stored and can be retrieved with return command eg r(mean), r(p75), r(p25), r(sd) SEM = r(sd)/sqrt(r(N))

to see return lists type return list

statsby mean=r(mean) sem=r(sem) size=r(N), by(group) gives grouped stats

Stats for all groups of categorical variable table x, contents (freq mean var1 sd var1) table x, contents(freq median var1 p25 var1 p75 var1)    ....stats on categorical  variable x including quartiles

table x y, contents(freq mean var1)  ................................stats on two categorical variables (x and y)

tabi x y / z w column chi2  exact   gives quick way for categoricla stats

table  slow  class  ethnic,  contents(mean days  sd  days) by(gender) format(%4.1f)

Stats on multiple  variables grouped using by command  tabstat age weight height, statistics (n mean  sd) by (sex) format(%4.1f) tabstat x, by(group) stat (N)

tabstat x if y==1, by(year) stats (median min max)

How to make stats per quartile of a variable xtile x = age(nq(4) tabstat x, statistics(n mean sd), by x

Populate a new variable with stats by a grouped categorical variable by group: egen var_new = median (y) by id:egen median_weight=median(weight)

bysort yr:egen x = rank(lov)

by group: egen var_new = count (y)

by group: egen var_new = sum (y)

Page 3: Stata Andrew

by group: egen var_new = pctile (y),p(75)     ie 3rd quartile

bysort yr: egen x = mean (y)

bysort yr:egen z = x-y if z==3 & y==4

bysort group:tabstat los, by(sex) statistics(count mean)

by yr:egen ranked_set = _n  (sequential case number by group eg year)

 

Make a new dataset of descriptive stats using collapse function collapse (p25) x, by (group) collapse (mean) x y

return values that can be used  include mean,median,p50,p75,p25,sd,semean,sum,count,rawsum,min,max,fisrt,last

Cumulative sum by group by group, gen tot = sum (x)

Highest value record egen high = record(wage), by(group) order (yr)

Count distinct values in a variable

bysort x y: generate count = (_n==1)by x:replace count = sum(count)by y:replace count= count(N)

Other method:

egen tag = tag (x y)egen count=total(tag), by(x)

or egen count=nvals(x), by(y)

Stats: Basic stuff

Basic stats:                                      good reference: http://www.ats.ucla.edu/stat/stata/whatstat/whatstat.htm

summarise x  summarise,detail x  (this gives details such as percentiles)

Stats by group: 

bysort yr:tabstat bedays,by(mo) statistics(median, p25 p75) table mo yr, contents(median  bedays )

tab x, contents(mean variableX sd variableX count variableX)

Page 4: Stata Andrew

 GENERERATE  median value for each SUBGROUP

by subgrp, sort: egen medstay = median(los)

GENERATE the deviation from the median length of stay

generate deltalos = los - medstay

SUBGROUP AGGREGATES (eg by month)

by mnth, sort: egen monthmedn= median(daymax) by mnth, sort: egen monthmax= max(daymax)

by mnth, sort: egen monthmin= min(daymax)

by mnth, sort: egen month25= pctile(daymax),p(25)

by mnth, sort: egen month75= pctile(daymax),p(75)

 Frequency distributions

tab x y tab x y, row         gives % for row  (can also use column)

tab x y, chi2        for chi squared (use exact for fishers)

Fishers or CHI squared

use cci command: 

cci 12 23 24 56, exact    .....will give this output:

    |   Exposed   Unexposed  |      Total     Exposed

-----------------+------------------------+------------------------

           Cases |        12          23  |         35       0.3429

        Controls |        24          56  |         80       0.3000

-----------------+------------------------+------------------------

           Total |        36          79  |        115       0.3130

                 |                        |

                 |      Point estimate    |    [95% Conf. Interval]

                 |------------------------+------------------------

      Odds ratio |         1.217391       |    .4710021     3.04953 (exact)

 Attr. frac. ex. |         .1785714       |   -1.123133    .6720806 (exact)

 Attr. frac. pop |         .0612245       |

                 +-------------------------------------------------

                                  1-sided Fisher's exact P = 0.4023

                                  2-sided Fisher's exact P = 0.6669

Page 5: Stata Andrew

   tab x y, chi2 nof column row  (nof does not show frequencies.Other stats are exact for FIshers) tabi 54 43 \ 56 78, column chi2  is chi2 with raw data tab1 x y z   produces oneway frequency for multiple variables tab1 varx-varz by group, sort: tab x y, nofreq col chi2 tab3way is good ado file for multi-column Odds ratio

cci 21 16 1 4 , exact

exactcci 21 16 1 4, exact

 

 Diagnostic tests (sensitivity, specificity, NPV, PPV) use module "diagt"

diagti 80 17 11 44

Showing means and discriptive stats in tables tab x, summ(y)  shows basic stats (mean sd freq) for groups in x tab x1 x2, summ(y)  means   two way table of x1 and x2 with means of y table x1 x2, contents(mean y1 median y2) stats(q) gives interquartile range

 

for tab command: ,cell gives % for each cell ,expected  gives expected distributions ,generate(new) plots dummy variables  eg. tab x1,gen(dummy)  produces dummy1 dummy2 dummy3 for each group of

x1 ,lrchi  likelihood ratio ,missing ,nofreq ,nolabel

 

Multi-table frequencies table y x2 x3, by(x5 x6) contents (freq)  

Page 6: Stata Andrew

by x3, sort: tab x1 x2, exact  (This is a 3 way table)

 

other stats for contents:   freq,mean,sd,sum,rawsum,count,n,max,min,median,iqr,p1,p99,p75 use format tab x y, contents(mean z median z) tab died yr, summ(pim) means    two-way table of means tab x , contents(mean z)

 

Table function with contents tab group cs, sum(los) means table group cs, contents (count los median los) table inhouse yr, contents (count los median los) table band2 yr, contents (count los)table age yr, contents (count los) table x y, by(z)

 

Confidence intervals

ci x, level (99) produces 99%confidence interval once x is summarisedTABSTAT   tabstat x,stats(….)

  tabstat pop, stat(mean), by (size) tabstat lov_days, by(yr) stat(mean sd min max) nototal long tabstat lov_days, by(yr) stat(n q) nototal long tabstat lov_days, by(yr) stat(mean sd min max) nototal long tabstat lov_days, by(yr) stat(n q) nototal long tabstat x, stat(mean, count, median), by(var2) tabstat x, stat(count mean q) by(y)    q is interquartile range  by x:tabstat y if z==2 & q==0, summarize(n mean q) record(x) is highest value of x (egenmore function)

 

 

     these are options between brackets:

·        mean 

·         count (count of nonmissing observations)

·        n same as count

·         sum sum

·         max maximum

·         min minimum

·         range range = max - min

·         sd standard deviation

·         sdmean standard deviation of mean = sd/sqrt(n)

·         skewness skewness

·         kurtosis kurtosis

·         median median (same as p50)

Page 7: Stata Andrew

·         p1 1st percentile

·         p5 5th percentile

·         p10 10th percentile

·         p25 25th percentile

·         p50 50th percentile (same as median)

·         p75 75th percentile

·         p90 90th percentile

·         p95 95th percentile

·         p99 99th percentile

·         iqr interquartile range = p75 - p25

·         q equivalent to specifying "p25 p50 p75"

 

OTHER BASIC STATS TESTS

 

Skewness test sktest x swilk x (shapiro wilks)

ladder x (produces powers with skewness test for normality)

gladder x (plots various distributions of x)

Parametric  One sam 

ple t-test: 

ttest write=50 (does mean differ from 50)

Non Parametric  One sam 

ple  ( 

Wilcoxon signed-rank test)

signrank write=50 (eg. does median differ from 50)

Binomial test  bitest female=.5 (eg. does proportion differ from 50%)

Parametric Two independent samples t-test

Page 8: Stata Andrew

ttest x, by(group)

Non Parametric Mann-Whitney test

ranksum x, by(group)

Parametric Paired t test  ttest x = y

Non parametirc Paired (Wilcoxon) signrank x=y

Parametric One way anova: anova x y

Non Parametric Kruskall Wallace : kwallis x, by(y)

Date stuff

Page 9: Stata Andrew

Import dates from excel in dd/mm/yyyy/ hh:mm  format (eg 12/03/2009 12:33)

gen double dt = clock(datevariable,”DMYhm”)

format dt %tc

label variable dt “Date”

 To convert clock date (dd/mm/yy hh:mm) to dd/mm/yyyy (eg 12/03/2001)

gen new_date = dofc(dt)

format new_date  %td

 To subtract dates with result in hours: (dates in %td format)

gen double hrs = hours(end_dt – start_dt)                        *

gen los_days = los_hr/24                                                        *

format los_days %9.1f

 To generate dd/mm/yyyy date:

gen birth = date(dob,”DMY”)

format birth %td

label variable birth “Date of Birth”

To convert STRING date in long  format (12/03/2009 00:00:00) to short format (12/03/2009)

gen x=substr(date,1,10) ....this keeps first 10 characters of string

then use date(x,"DMY")  to format the date as td format

Script to convert clock date to multiple formats

gen before = cond(hired_on < td(15jun2004), 1, 0) if hired_on < .

drop if admitted_on < tc(15jun2004 12:00:00)

gen date_tc = clock(x,"DMYhm")       // format structure is 12/03/2006 12:30

Page 10: Stata Andrew

gen date_td =dofc(date_tc)               //convert to 12/03/2006   DAY/MONTH/YEAR format

gen date_tm=mofd(date_td)              //convert to 2006/03  YR/MONTH format

format date_tc %tc

format date_td %td

format date_tm %tm

label variable date_tc "Date Clock"

label variable date_td "Date"

label variable date_tm "Yr-Month"

gen yr=year(date_td)         //year

gen mo=month(date_td)    // month

gen day=day(date_td)       //day

gen doy=doy(date_td)      //day of year

Date stuff (stata 11)

Page 11: Stata Andrew
Page 12: Stata Andrew
Page 13: Stata Andrew

Dates are set from jan 1 , 1960

 

Format Description

%td daily

%tw weekly

%tm monthly

%tq quarterly

%th half-yearly

%ty yearly

 

generate y = date(doa, "DMY") format y %td

Date field in "12/03/2009 23:34" format , use clock function (1 unit = 1 millisecond)

gen double dt_clock= clock(datevariable,"DMY hm")

To convert clock format  to date format  use dofc  

gen newdate = dofc(date_clock)   ie 12/03/2009            Format x d%  or format %tc  gives hh:mm as well

Date in "12/03/2009" format, use date function (1 unit = 1 day)

gen y =date(x,"DMY") 

if year is in format 09  instead of 2009, precede "DMY" by century  eg "DM20Y"   

if date spans centuries, use (x,"DMY",2020) for  1998 and 2000 (use largest century date)

 

  generate birthday=mdy(month,day,year) generate m=month(birthday)

generate d=day(birthday)

generate y=year(birthday)

dow(x)  date of week 

generate weeks = diff/7

generate months = diff/30.5

Page 14: Stata Andrew

generate years = diff/365.25

 

 

Other functions:

weekly(x,"wy")

monthly(x,"my")

quarterly(x,"qy")

halfyearly(xr,"hy") 

yearly(x,"y")

 

If three columns for each day,month and year use: gen y = mdy(month,day,year)    gen x = mdy(x,y,z)

 

 

mdy(month,day,year) daily

yw(year, week) weekly

ym(year,month) monthly

yq(year,quarter) quarterly 

yh(year,half-year) half-yearly 

 

Translating to td% dates (DD/MM/YYYY)

dofw() weekly to daily

dofm() monthly to daily

dofq() quarterly to daily

dofy() yearly to daily

 

Translate from %td dates:

 

wofd() daily to weekly

Page 15: Stata Andrew

mofd() daily to monthly

qofd() daily to quarterly

yofd() daily to yearly

 

 To reference dates : reg x y if w(1999w10) sum salary if q(1998-4)

tab sex if y(2007)

To reference range of dates use the tin() and twithin() functions: reg y x if tin(01feb1998,01jun1998) sum income if twithin(1990-1,1990-3)

tin() includes the beginning and end dates, twithin() excludes them

Stata: Visual date   displays

%tc | mdyhms(M, D, Y, h, m, s)

%tc | dhms(td, h, m, s)

%tc | hms(h, m, s)

%td | mdy(M, D, Y)

%tw | yw(Y, W)

%tm | ym(Y, M)

%tq | yq(Y, Q)

%th | yh(Y, H)

%ty | Y

clock values (%tc) for data in format "12/03/2009 23:34"

format x = hh(x) shows hours. Mm(x) or ss(x)

gen x = mdy(m,d,y) or mdyhs

Page 16: Stata Andrew

gen bdayday = day(bdaynew)

gen bdaydow = dow(bdaynew)

gen bdaymo = month(bdaynew)

gen bdayyr = year(bdaynew)

Convert date times  (NOTE %tc is milliseconds, %td is seconds, %tm is months)

%tc to %td use dofc(x) ie from "12/03/2009 23:34" to "12/03/2009"

%td to %tm use mofd( )

%td to %tq use qofd( )

then can apply year( ) month( ) day( ) doy( ) dow( ) this is day of yr from 1-365 or day of week

halfyear( ) quarter( ) week( ) dow ( )= sunday

Conditional date arguments

gen before=cond(adm<td(15jun2006),1,0

list if !inrange(x,2,10) lists if not in range 2 to 10

list age if inrange(population,200,5000)

gen byte x = inlist(x,"one","two")

egen x =rcount(v1-v4),cond(@>5 & @>15)

by year,sort:egen y=sum(died)

generate bdaynew=date(bday,"mdy", 2010) if data as 02/03/07

Date script

TYPICAL DATE SCRIPT (assume date in string format as dd/mm/yyyy hh:mm:ss) and you want %td format

 gen x =substr(date,1,10)  //converts string date to dd/mm/yyyy string

drop date

Page 17: Stata Andrew

gen date= date(x,"DMY")  //converts date to %td format

format date %td

label variable date "Date"

gen yr = year(date)

gen mo = month(date)

gen day=day(date)

gen month_yr =mofd(date) //convert to month yr format

format month_yr %tm

FOR FULL CLOCK FORMAT use (assume date in string format as dd/mm/yyyy hh:mm:ss)

gen date2 = clock(date,"DMYhms")

format date2 %tc

Graph tips

Histograms histogram x, by(group, total) percent bin(10) histogram x, frequency title("Graph1) xlabel(15(10)30) ytick  (1(2)10) start(10) width(2) norm gap(15)

  ........  start is where bar begins, width id bin size , norm overlies curve, gap

Scatterplots graph twoway scatter x y (can use xlabel, xtick, xtitle and also msymbol) mysymbol () can be:  O,o, D,d, T,t, S,s, +, smplus,X,x, (add "h" for hollow eg Oh = hollow big circle) P = point, i = none

scatter twoway x y [fweight = age], by(group) symbol (oh) mlabel(id)  allows bubble plot with size for age

to format  axis numbers eg 1.33 to 1.3 use: ylabel(,format(%3.1f))

Page 18: Stata Andrew

Lineplots graph twoway line x y year, legend (label(1 "label a") label(2 "label b") position(2)  ring(0) rows(2) ytitle(" this plots line x and y against time (year). the legend will be placed in the graph (ring(0) in top right corner with 2 rows

can use xtick eg (1960(2) 1980)  for every 2 yrs

ylabel(0(10)50),angle(horizontal))  will plot label for  y axis horizontally 

clpattern is type of line and  can be: solid,dash,dot,dash_dot,shortdash,shortdash_dot,longdash,blank

if two lines then specify each in the plot eg msymbol (T Oh) clpattern(dash solid) 

Barplots Can do summaries eg graph bar (median) x,over(group) blabel(bar,size (medium)) bar(1, bcolor(gs10)) bar(2, bcolor(gs(7)) note: blabel puts the value on top of each bar

bar labels can vary in size: size(small)  or tiny medsmall medlarge large

Stacked bar graph : graph bar(sum) x y z, over(group) stack

graph hbar for horizontal bar graph

Other graphs: qqplot xy quantile x

qnorm x,grid

Mean /SEM type graph: graph twoway rcap xlow xhigh year || connect z_mean year, legend (off)   if z_mean is mean and each 95% CI is in dataset as xlow and

xhigh respectivelyy (eg after collapse command). Can use ANOVA comand to creat mean/SEM graphs by using "predict" command after ANOVA:

anova income year

predict income_mean (generate smean value for income)

predict SEincome (generates Standrd error for income) then use serrbar scale(2) to plot  income_mean +2 x SEincome

serrbar income_mean SEincome year, scale(2) addplot (line income_mean year,clpattern(solid))) legend(off)

or do following:  **from Statistics with Stata by Hamiltonanova income year gender*yearpredict aggmeanpredict SEagg, stdpgen agghigh = aggmean +2 * SEagggen agglow = aggmean -2 * SEagg

Page 19: Stata Andrew

graph twoway connected aggmean year || rcap agghigh agglow year, by (gender, legend(off) note(" ")) ytiltle (" xxx")

Overlapping graphs graph twoay lfitci x y || scatter x y , xlabel(10(2)20) ylabel(2(10)20, angle(horizontal)) legend(order 2 1) label(1 "95% CI") label(2

"regression line") rows(3) position(1) ring(0)

Combining graphs

graph  twoway x y ............... saving (fig1)graph  twoway z x............... saving (fig2)graph combine fig1.gph  fig2.gph imargin(vsmall) rows(2)            rows is numberof rows on graph

Graph time-series

tsset date_x,format(%td) to set data as daily where date_x is date variable

tssmooth ma newvar = x, window (2 1 2) generates a 5 day moving average (2 lagged, current value and 2 leading)

graph twoway line admissions date_x      plots line graph of admissions vs date but x-axis looks too busy so use

graph twoway tsline admissions, ylabel(10(10)100) ttitle(" ") tlabel(01jan1983 01mar1983, grid) ttick(01feb1983 01apr1983 01jun1983) clwidth(thin) clpattern(solid)

using tsline, because data is tsset, one doesn't reference time, surpress the title by ttilte("")

Other way to generate moving average is egen:egen moving_av = ma(x), nmiss t(3)  gives 3 day moving average if data is tsset to daily formats

tssmooth nl command is better if outliers present

tssmooth nl x_smooth =x, smootehr(4253h,twice) ...smooths the  running median by different span combinations of 1/4,1/2,1/4 weighted moving avergae of span 3 according to Velleman

4/01/2010 19:22:24

Date: convert date to clock value. Must be in format 12/03/09 12:34

gen addate1 = clock(doa,"DMY hm") // for PICU data where date field is 12/03/2008 12:33format addate1 %tc

*generate date format ie DDMMYYYY from clock value abovegen addate = dofc(addate1)format addate %td Dates

02/02/2010 19:46:28 String to time conversion. http://www.sealedenvelope.com/stata/time.php

str2time tod, generate(etod)eg:tod (string) 03:17 =etod (double) 0.13680556

02/02/2010 19:48:03

Covert decimal time to HH:MM format (24hrs =1) converts a numeric variable containing elapsed times to a string variable containing times in 24 hour clock format (HH:MM or HH:MM:SS). http://www.sealedenvelope.com/stata/time.php time2str etod, generate(tod)

04/01/2010 21:17:08

stats on multiple tables of variables by groupCan be count data or any other statistic

table x y z, contents(freq) by(group)table x y z, contents(mean) by(group)table x y z, contents(median var1 mean var2) by(group)

Page 20: Stata Andrew

24/01/2010 19:19:51

_n manipuations: Sequentialy count of oservations in subgroups (eg count from observation1,2,3,4 ... for every year, with each year starting at 1 again

by yr, sort: gen obs2 =_n gives obs count per subgroupby yr, sort: gen obs2 =_N gives max count per subgroup

24/01/2010 19:25:04 Reshape data between wide and long formats

i = identifier, j = subgroup, stub = value

wide: devides dataset into stubs of j(subgroup)

eg: i=patient, j=time, stub= Na

reshape wide Na, i(patient), j(time) will set data in format to use as anova

eg patient with time 1 2 3 as variables

to move between the above example datasets use:

reshape long x, i(id) j(year)

reshape wide x, i(id) j(year)

These steps “undo” each other.

24/01/2010 19:26:57 Start up do file

make file "profile.do£ for startupput it in root of Stata directoryIn this do-file put all basic strat up commands

02/02/2010 19:43:21 Web page http://www.sealedenvelope.com/stata.php

hl Hosmer-Lemeshow goodness of fit testreformat convert regression output into near publication quality tablestime utilities to translate strings in 24hr clock HH:MM format to elapsed times and back againxcount count clusters in longitudinal dataxfill fill in static variablesxtab tabulate longitudinal data at the cluster level

02/02/2010 20:01:20

web http://www.cpc.unc.edu/research/tools/data_analysis/statatutorial/index.html University Carolina Stata pages

24/01/2010 20:16:13

Plot ANOVA graphanova v1 gender year gender*yearpredict vmean

label v1mean "mean v"predict SEv, stdp //predict calculates stnd error of predicted meansgen vhigh =v1mean + 2 * SEvgen vlow =v1mean - 2 * SEv

graph twoway connected vmean year ///rcap vhigh vlow year , by (gender), legend(off) note(" ")) ytitle("mean v")

24/01/2010 20:16:45 Plot ANOVA graphs with error bars

anova drink yearpredict drinkmean // drinkmean is a new variablelabel variable drinkmean "Mean drinking scale"predict SEdrink, stdp //stabdard error of predicted means

serrbar drinkmean SEdrink year,scale(2)///addplot(line drinkmean year, clpattern(solid)) legend (off)

24/01/2010 20:18:37

Plot ANOVA graphs across timextline info: http://www.ats.ucla.edu/stat/stata/faq/xtline.htm

if data in long format use xtline to plot time series of y (id time y as variables)

Page 21: Stata Andrew

http://www.ats.ucla.edu/stat/stata/faq/visualize_longitudinal.htm

xtline y, t(time) i(id) overlay

if data in wide form (each time point has seperate variable eg id time1 time2 time3 time4)profileplot time1 time2 time3 time4, by(id) profileplot info: http://www.ats.ucla.edu/stat/stata/faq/profileplot.htm

24/01/2010 20:20:05 Bar graphs by Group

graph bar (sum) x y zgraph bar (sum) x ,over (y)graph bar (sum) x ,over (y) stack

Can use mean, median, percent count instead of sum (or any function as in "collapse" function). Can add xline(5) or y lines

graph bar (median) x,over (y) blabel(bar) bar(1,bcolor(gs10)blabel puts numbers on top of bar, bcolor is greyscale can size up label: blabel( bar, size(medium ))For second bar add to end: bar(2, bcolor(gs7) and at start :Graph bar x y,

24/01/2010 20:21:25 CATPLOT, BEAMPLOT graphs

beamplot x, by(group1) over(group2)catplot bar rep78, by(foreign) percent(foreign)catplot hbar rep78, by(foreign) percent(foreign)

24/01/2010 20:22:00 Boxplot

graph hbox x,over(group, sort(1)) yline(5) intensity(30) marker(1,mlabel(z) mlabpos(12)

intensity shadingXline in hbox are ylines

26/01/2010 16:50:04

How to "collapse" data and then plot summary values:collapse (mean) dep (sd) sddep=dep (count) n=dep, by(visit group)

error bar plot:sort groupgen high = dep + 2*sddep/sqrt(n)gen low = dep - 2*sddep/sqrt(n)twoway (rarea low high visit, bfcolor(gs12)) (connected dep visit, mcolor(black) clcolor(black)), by(group) legend(order(1 "95% CI" 2 "mean depression"))

26/01/2010 17:01:01 How to refer to scalar in a graph title (CUSUM Graph. Comp=`=scalar(comp)')

26/01/2010 17:05:10 How to show r2 regression result in title of regression graph

eretrun gives r2 valuereg x ylocal r2: display %5.4f e(r2)twoway (lfitci x y) (scatter x y), note(R-squared=`r2')

26/01/2010 17:08:32 Histograms

histogram x, frequency title("xxx") xlabel (13(2)34) ylabel(0(2)10) tick(1(2)4) start(10) width(2)options:bin(3)percentgap(5)addlabel. Adds values on top barsdiscrete onebar for xnorm. Overlies norm plot

26/01/2010 17:09:09 Histogram by group

if hisrogram of totals also needed use by(group,total):Histogram x, by(group) percent bin(5)

26/01/2010 Histogram and boxplot on same graph. 

Page 22: Stata Andrew

17:09:56 Install histbox.ado02/02/2010

20:12:02 Add linear regression line to scatterplot12/02/2010

16:07:03 graph descriptive data with one comand: sixplot module

21/02/2010 08:45:17

ERROR BAR GRAPH: Draw graph with time on x-axis and mean/SE error bars or median and IQR bars (eg blood pressure between groups over time). Need to xtsetset data to panel format.Needs module XTGRAPH. USEFULL

21/02/2010 10:28:39 Stacked bar graph

26/01/2010 17:11:07 Fill in missing gaps in variable in bulk

24/01/2010 19:30:11 Statistics by group (eg median, means, IQR)

26/01/2010 16:48:30 Decode all missing values to a fixed number (eg 999)

26/01/2010 16:53:22 Generate random numbers and mix up the numbers randomly after this

26/01/2010 16:55:59 Removing duplicated observations in an entire file

26/01/2010 17:00:18 Counting by groups

02/02/2010 19:52:45

Count cluster data. XCOUNTnet from http://www.sealedenvelope.com/

02/02/2010 19:59:04 tab at cluster level (for panel data)

12/02/2010 16:24:34 subtract the previous value in a running calcualtion

12/02/2010 16:27:40

bysort command for grouped stats on a variable.Does not need sorting before the command is run

12/02/2010 16:30:06 Sorting orders within groups (_N _n values)

12/02/2010 16:40:40 table of means/SD/count across groups: use oneway test

12/02/2010 16:51:36 Calculate difference in 1 value from next or sum of next value

12/02/2010 17:02:43 Counting variables by group

13/02/2010 16:09:00

Tabulate ranges of each quartile for avariable using "xtile" to split data in quartiles, and tabstat to create grouped stats by quartiles. VERY USEFULL

Page 23: Stata Andrew

13/02/2010 16:26:08

Create a list with descriptive stats (mean, n, median etc) for a list of items in a group using EGEN command. Usable stats include: mean, n, iqr, kurt, skew, sd, mdev, median, pc (percent of proportion), p(n) which is percentile, sd, rank. USEFULL

13/02/2010 16:27:28 Setting highest value (record value) in a variable

12/02/2010 16:03:16 web: new modules from

24/01/2010 19:56:44 Goodness of fit

24/01/2010 19:57:10 T tests

24/01/2010 19:58:06 Mann Whitney Kruskall Wallace

26/01/2010 16:47:33 Adding formating to decimal places after the tab command for basic stats

26/01/2010 16:51:22 How to tabulate summary values(eg median) by group

26/01/2010 16:52:37

Odds ratio with Fishers exact test, knowing the numbers per groupeg: 21 and 16 vs 1 and 4

26/01/2010 17:11:49 Diagnostic tests (specificity and sensitivity and predictive values)

12/02/2010 16:48:53 Chi squared with known proportions or fischers exact

13/02/2010 16:04:55 Grouped table of stats in multiple rows using tabstat (medians, means etc) VERY USEFULL

16/02/2010 21:28:59

Generate parameter data (intercept, slope, standard erros for both) for clustered longitudinal data (cluster =id) using linear regression (outcome variable =sofa; time variable =day).Generates parameters for each cluster (id).

19/02/2010 13:09:05

XCONTRACT: great way to generate count/percentage , cumulative percentage stats on grouped data (like COLLAPSE function but without overwriting tables)Need to download module first from SSC

04/01/2010 21:12:56 Convert string to numeric

23/01/2010 23:11:48 cumulative count by variable (_n)

24/01/2010 Make dummy variables from a categorical variable. eg 

Page 24: Stata Andrew

10:29:00variable "size" has size =0, size =1, size =2Useful in regression

24/01/2010 19:18:32 List and define missing variables: use "codebook"

24/01/2010 19:23:30

Conditional argument for string:if variable is string eg "alive"or dead" then convert to 1,0

24/01/2010 19:54:37 Tabstat (for statistics by groups)

26/01/2010 16:57:06 Split characters off a string in different positions

26/01/2010 16:59:27 Split the first word of a string before a delimiter eg. "John:Smith" to John where delimter is ":". Delimiter can be blank space, if so use use (,)

26/01/2010 17:03:55 Summarise missing values for variables

02/02/2010 20:16:06 One to one merging (if each uniqueid has a matched uniqueid in another file

03/02/2010 14:56:41 Conditional arguments for strings

12/02/2010 15:57:11 Saving log and commands panel

12/02/2010 16:10:22 Replace contents of a string. FDTA module: fdta checks the string varibles, searching for st1. Whenever a matching str1 is found, it is replaced with str2.

12/02/2010 16:36:06 Coding variables into categorical groups

12/02/2010 16:38:51 Quick stats (T tests, Mann whitney etc)

12/02/2010 16:43:55 Recode continuous variable into defined categories with cut command

12/02/2010 16:44:32 cumulative sum

12/02/2010 16:46:45 replace variable with previous value if a value is missing

12/02/2010 16:56:01 Missing value in if command

12/02/2010 16:58:32 Find variables in list

12/02/2010 17:00:08 recode variable

12/02/2010 17:01:02 cumulative sum by group

12/02/2010 17:07:03 number of distinct observations by group

Page 25: Stata Andrew

13/02/2010 16:30:30 Separating lines in list

13/02/2010 16:33:39 List of TOTAL values by a group using EGEN command. USEFULL

13/02/2010 16:35:46 Current dates and times

21/02/2010 09:56:41 Generate a numerical sequence in an empty variable eg. 0,5,10...

21/02/2010 09:58:56 Convert a string to a number

21/02/2010 14:59:00 MISSING VALUES: Generate a variable to define if a list of variables have missing values

25/02/2010 11:17:15

GROUP COMMANDGroup string or continuous data into distict groups (integers 1,2,3 etc)1st distinct value coded as 1, next as 2 and so on

25/02/2010 11:19:23 Number lists examples

25/02/2010 15:37:55 REPLACE MISSING value with the preceding value

25/02/2010 15:47:03 LAG value by previous or next value

25/02/2010 15:55:18 RECODE

25/02/2010 17:17:04 Referencing SCALARS in variables or in graphs titles

27/02/2010 23:11:58 DIAGNOSTIC TESTS: specificity, sensitivity, predictive values

28/02/2010 23:45:33

Bulk RENAMING VARIABLES: renvars module

03/03/2010 20:38:31 STATSBY: produce tables of discriptive stats. RE-writes data into new dta tables

04/03/2010 21:49:11 Fill a sequence (use egen)

07/03/2010 22:29:31 Convert TIME in HH:MM (eg 12:33) to minutes

13/04/2010 16:16:25

COunt unique distinct value sin s astring where multiple duplicates may occur. Eg each patient has a unique id even if admitte dmany times)

15/04/2010 15:12:36

Create variable containing the median length of stay for each diagnostic code. 

15/04/2010 stats for multiple groups using SORT

Page 26: Stata Andrew

17:19:29

21/04/2010 21:49:19 Profile.do template file

21/04/2010 22:01:32 Moving average

21/04/2010 22:21:17 Breaking a categorical variable into a set of binary variables (make each categorical value a seperate new variable)

21/04/2010 22:23:36 ERROR BAR AREA GRAPH after collapsing data to basic stats

collapse (mean) dep (sd) sddep=dep (count) n=dep, by(visit group)error bar plot of this:sort groupgen high = dep + 2*sddep/sqrt(n)gen low = dep - 2*sddep/sqrt(n)twoway (rarea low high visit, bfcolor(gs12)) /**/ (connected dep visit, mcolor(black) /**/ clcolor(black)), by(group) /*

29/04/2010 13:24:14

GROUP command to combine a numbe rof variables into a unique group egen newid = group( var1 var2)

29/04/2010 13:25:46 Grouping code

Egen faminc =sum(income),by(family)Egen faminc =max(income),by(family)

03/05/2010 22:50:13

Pyramid type plot for day vs night shifts. Shows how to change graph axis label size

twoway (line beds_AM day)(line beds_PM day), by(month) ///ylabel(0(5)30, angle(horizontal) valuelabel labsize(*.8)) xline(15)xline(0) xline(-15) /// xtitle("Max PICU beds 2009") ytitle("")xlabel(-25 "25" -20 "20" -15 "15" -10 "10" -5 "5" 0 5 10 15 20 25, valuelabel labsize(*.8)) ///legend(label(1 Day) label(2 Night))

19/05/2010 15:06:29

_N count for only maximal value of a grouped variable list(eg making avariable with maximal value per year per patient)

by patientid:gen x = _negen y = max(x), by patientidgen var1 = yr if x = y

15/06/2010 19:14:49 Scatterplot with regression line and r2 coeficiant (SPARLmodule

sparl x yregress

08/07/2010 00:18:17 eperiod module TIME DIFFERENC calculatedifference between dates (yrs, months days)

Centering data

Return lists:

following the summarize command, the following list returns are available: r(N), r(sum), r(mean),r(sd),r(min), r(max)

To standardise values:

Page 27: Stata Andrew

summarise agegen age_c = (age -r(mean))/r(std)

Foreach loop to centre data: in this example  the variables los and los1 are centered creating los_c and los1_c

foreach var of varlist los los1 {summarize `var', meanonlygenerate `var'_c = `var' - r(mean)label variable `var'_c "`var' (centered)"}

Loops

Basic loops that can be modified to perform many tasks

foreach  variable  in  v1  v2  v3  {list  `variable´}

foreach  number  of  numlist  1  2  3  {display  `number´}

forvalues  i=1/3  {display  `i´}

The same output can also be produced using while as follows:local  i  =  1while  i<=3  {disp  `i´local  i  =  `i´  +  1}

Looping through a variable listlocal xlist na k cl foreach  v  of  varlist  `xlist' {summ  `v'correlate  sbe `v'scatter  sbe `v'}

Looping through a variable for duplicatesforeach var of varlist dupl_tag{

Page 28: Stata Andrew

sort dupl day h_hour h_ph sampleby dupl: gen x=h_na[_n-1]- h_nadrop if x<0.05 & dupl_tag ==`var'drop x}