STATA Training Session 3

53
STATA Training Session 3 Advanced Topics in STATA Sun Li Centre for Academic Computing [email protected]

description

STATA Training Session 3

Transcript of STATA Training Session 3

Page 1: STATA Training Session 3

STATA Training Session 3

Advanced Topics in STATA

Sun LiCentre for Academic Computing

[email protected]

Page 2: STATA Training Session 3

OutlineOutline� Resources And Books

� Survival Analysis

� Kaplan-Meier Estimator

� Cox Regression

� Time Series and Forecasting

� Exponential Smoothing

� ARIMA Models

� Introduction to Panel Regression with STATA

Page 3: STATA Training Session 3

Resources And BooksResources And BooksCAC Computing Resources for STATA users

� Windows: � STATA/SE version 10.0� 10-user network perpetual license� Installation guide

(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STAT(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA-Software Questions.aspx)

� Linux – CAC Beowulf Cluster:� STATA/SE version 10.0� Unlimited users� About CAC Beowulf Cluster:

(http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)

� New features in STATA 10.0 (http://www.stata.com/stata10)

Page 4: STATA Training Session 3

Resources And BooksResources And Books� Website resources:

� The STATA website: http://www.stata.com

� The STATA journal – reviewed papers, regular columns, user-written software: http://www.stata-journal.com/

� STATA FAQ : http://www.stata.com/support/faqs� STATA User Support : http://www.stata.com/support� Books: http://www.stata.com/bookstore/

� CAC STATA support: � Website:

http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx

� Contact: � For statistical consultation: Sun Li: [email protected]� For software installation: TAN Suh Wen: [email protected]

Page 5: STATA Training Session 3

Resources And BooksResources And Books

� Additional recommended readings:

� Econometric Analysis of Cross Section and Panel Data, Jeffrey M. Wooldridge

� An Introduction to Modern Econometrics Using Stata, Christopher F. BaumBaum

� New Introduction to Multiple Time Series Analysis, Helmut Lütkepohl

� Applied Survival Analysis: Regression Modeling of Time to Event Data, 2nd Edition, David W. Hosmer, Jr., Stanley Lemeshow, and Susanne May

� An Introduction to Survival Analysis Using Stata, Revised Edition, Mario Cleves, William W. Gould, and Roberto G. Gutierrez

Page 6: STATA Training Session 3

Download Training Slides , data and Syntax:

http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/Training%20Slides%20and%20Syntax.aspx

Page 7: STATA Training Session 3

Survival AnalysisSurvival Analysis

Survival Data

� Data: survival data is time-to-event data. It’s quantitative data corresponding to time from a well-defined time origin till the occurrence of some particular event of interest or endpoint.

� Reason of using survival model: � Reason of using survival model: � The distribution of survival data tends to be positively skewed and not likely

to be normal distribution and it may not be possible to find a transformation.� Time-varying covariates could not be handled. � In addition, some duration is censored.

� Censored observations: could be the event has not occurred at endpoint; lost to follow-up; withdraws from study; other interventions offered; event occurred but for unrelated cause; etc.

Page 8: STATA Training Session 3

Survival AnalysisSurvival Analysis

Survival Model

Survival function:)(1)()( tFtTPtS −=≥=

)())(log(

)(

)( thtSdtf

th =−=>=Hazard function:

)( )(

)( thdttS

th =−=>=

))(exp()( tHtS −=

function. hazard cumulative is )(tH

Page 9: STATA Training Session 3

Survival AnalysisSurvival AnalysisKaplan-Meier Estimator:

∏≤

−=ttj j

j

jn

dtS

)(|

)1()(ˆ )( jt

The number of individuals who experience the event at time

The number of individuals who have not yet experienced the event at time )( jt

)()2()1( .... nttt <<

Cox Regression:

iT

i

xii

Ti

xtHtH

tStSxthth iT

ββ β

+==>

==>=

)(log))(log(

)()()exp()()(

0

)exp(00

is the baseline hazard function.)(0 th

))(exp( jiT xx −β is the hazard ratio (HR) or incident rate ratio.

Page 10: STATA Training Session 3

Survival AnalysisSurvival AnalysisSurvival Analysis in STATA telco.csv

Variable name Variable information

age Age in years

marital Marital status 0=unmarried 1=married

address Years in current address

income Household income in thousands

ed Level of educations1= didn’t complete high school 2= high school degree3= college degree 4= undergraduate 5= postgraduate

employ Years with current employer

reside Number of people in household

gender Gender 0=male 1=female

tenure Months with service

churn Churn within last month0 = ‘No’ 1=‘Yes’

custcat Customer categories1= basic service 2= E-service 3= plus service 4=total service

Page 11: STATA Training Session 3

Survival AnalysisSurvival AnalysisDeclaring and summarizing survival-time data:

insheet using telco.csv

d

stset tenure, failure(churn)

_st: 1 if the record is to be used, 0 if ignored;

_d: 1 if failure, 0 if censored;

_t: analysis time when record ends;

_t0: analysis time when record begins.

Page 12: STATA Training Session 3

Survival AnalysisSurvival Analysisstsum

ltable tenure churn

sts graph, by(custcat)

Kaplan-Meier survival estimates

0.00

0.25

0.50

0.75

1.00

0 20 40 60 80analysis time

custcat = 1 custcat = 2custcat = 3 custcat = 4

Kaplan-Meier survival estimates

Page 13: STATA Training Session 3

Survival AnalysisSurvival AnalysisFitting regression models:

sw stcox age marital address income ed employ retire, pe(0.05)

xi : stcox employ address marital income i.custcat

test _Icustcat_2 _Icustcat_3 _Icustcat_4

Page 14: STATA Training Session 3

Survival AnalysisSurvival Analysischar marital [omit] 1 char custcat [omit] 4

xi:stcox employ address i.marital income i.custcat, basesurv(s) basehc(h)

Page 15: STATA Training Session 3

Survival AnalysisSurvival Analysisstcurve, survival

stcurve, hazard

stcurve, survival at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1)

stcurve, hazard at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1)Cox proportional hazards regression Cox proportional hazards regression_Icustcat_3=1)

.5.6

.7.8

.91

Sur

viva

l

0 20 40 60 80analysis time

Cox proportional hazards regression

.5.6

.7.8

.91

Sur

viva

l

0 20 40 60 80analysis time

_Icustcat_1=1 _Icustcat_2=1_Icustcat_3=1

Cox proportional hazards regression

Page 16: STATA Training Session 3

Survival AnalysisSurvival AnalysisExamining the proportional hazards assumption:

stphplot, by(custcat)

6-ln

[-ln

(Sur

viva

l Pro

babi

lity)

]

The proportional-hazards assumption is not violated when the curves are

02

4-ln

[-ln

(Sur

viva

l Pro

babi

lity)

]

0 1 2 3 4ln(analysis time)

custcat = 1 custcat = 2custcat = 3 custcat = 4

the curves are parallel.

Page 17: STATA Training Session 3

Survival AnalysisSurvival AnalysisExamining time-varying covariates:

xi : stcox employ address i.marital income i.custcat, tvc(employ)

estimates store model1

xi : quietly stcox employ address i.marital income i.custcat

lrtest model1 .

Page 18: STATA Training Session 3

Survival AnalysisSurvival Analysis

Exercise 1

Repeat the above analysis by treating customer categoryas stratifying variable instead of a covariate.

Page 19: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

Definitions, Applications and Techniques

Time series data: each case represents a point in time. Each cell gives a value for each variable for each time period.� Stationarity: Data are stationary. A stationary process has the property

that the mean, variance and autocorrelation structure do not change over time.

� Seasonality: By seasonality, we mean periodic fluctuations.

The usage of time series models is:� to obtain an understanding of underlying forces and structures that

produce the observed data.� to fit a model and proceed to forecasting and monitoring.

Techniques:� Exponential Smoothing� ARIMA Models

Page 20: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

Exponential Smoothing

Four available model types:

� Simple. The simple model assumes that the series has no trend and no seasonal variation.

� Holt. The Holt model assumes that the series has a linear trend and no seasonal variation.

� Winters. The Winters model assumes that the series has a linear trend and multiplicative seasonal variation (its magnitude increases or decreases with the overall level of the series).

� Custom. A custom model allows you to specify the trend and seasonality components.

Page 21: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

General form of models:

� Single Exponential Smoothing:

� Double Exponential Smoothing:

2 , )1()1( 22

12

1

≥−+−= −−

−−

=∑ tSyS t

it

it

it ααα

10 )1()(

10 ))(1(

11

11

≤≤−+−=≤≤+−+=

−−

−−

γγγααα

tttt

tttt

bSSb

bSyS

� Triple Exponential Smoothing:

10 )1()( 11 ≤≤−+−= −− γγγ tttt bSSb

Forecast )(

smoothing Seasonal )1(

smoothing Trend )1()(

smoothing overall )-(1

11

11

mLtttmt

Ltt

tt

tttt

tt-Lt

tt

LmbSF

LS

yL

bSSb

)b(SL

yS

+−+

−−

−−

+=

−+=

−+−=

++=

ββ

γγ

αα

Page 22: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

ExampleData: seasfac.csv

Variable name Variable information

date Date

men Sales of men’s clothing

mail Number of catalogs mailed

page Number of pages in catalogs

phone Number of phone lines open for ordering

print Amount of spent on print advertising

seasonal_facors_men Seasonal Factors for Sales of Men's Clothing

year_ Year of the date

month_ Month of the date

Page 23: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & ForecastingStep 1: Understand your data

3000

040

000

insheet using seasfac.csvd

gen mdate=ym(year_, month_)

010

000

2000

0m

en

1989m1 1991m11990m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1

scatter men mdate, c(l) sort xlabel(, grid) ylabel(,grid)

Page 24: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

Step 2: Declare time series data & Test stationarit y assumption

tsset mdate, monthly

list men d.men l.men in 1/10

dfuller men, regress trend

Page 25: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

Step 3: Detect seasonality with autocorrelations

reg d.men l.men

predict res1,r

corrgram res1

wntestq res1wntestq res1

Both autocor and partial autocor charts show the structure of the annual seasonality of the time series.

Q-test for white noise: if the test is significant, the residuals are correlated.

Page 26: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

Step 4: Construct Holt-Winters seasonal smoothing m odel

� Single-exponential smoothing

tssmooth exponential men1=men

Holt-Winters nonseasonal smoothing� Holt-Winters nonseasonal smoothing

tssmooth hwinters men2=men, from(.1 .1) iterate(100)

� Holt-Winters seasonal smoothing

tssmooth shwinters men3=men, sn0_0(seasonal_factors_men) from(.1 .1 .1) iterate(100)

Page 27: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

3000

040

000

line men1 men2 men3 men mdate0

1000

020

000

1988m1 1990m1 1992m1 1994m1 1996m1 1998m1 2000m1mdate

parms(0.1057) = men hw parms(0.000 0.000) = menshw parms(0.013 0.138 0.000) = men men

Page 28: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

Step 5: Predictions

2500

030

000

3500

040

000

4500

0

4000

0

line men3 men mdate

line men3 men mdate if mdate>467

2000

0

1999m1 1999m4 1999m7 1999m10 2000m1mdate

shw parms(0.013 0.138 0.000) = men men

010

000

2000

030

000

4000

0

1988m1 1990m1 1992m1 1994m1 1996m1 1998m1 2000m1mdate

shw parms(0.013 0.138 0.000) = men men

Page 29: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

ARIMA Model – ARIMA(p, d, q)

� Autoregression (AR): p is the order of autoregression� Integration (I): d is the order of integration (differencing)� Moving-Average (MA): q is the order of moving-average

AR(p) model:AR(p) model:

MA(q) model:

ARIMA(p, d, q) model:

tptpttt AXXXX +++++= −−− φφφδ ...2211

qtqtttt AAAAX −−− −−−−+= θθθµ ...2211

t

q

i

iit

dp

i

ii ALXLL )1()1)(1(

11∑∑

==

+=−− θφ

Page 30: STATA Training Session 3

SHAPE INDICATED MODEL

Exponential, decaying to zero Autoregressive model. Use the partial autocorrelation plot to identify the order of the autoregressive model.

Alternating positive and Autoregressive model. Use the partial autocorrelation plot

Time Series Analysis & ForecastingTime Series Analysis & ForecastingExampleStep 1: Identification of orders of ARIMA model

Alternating positive and negative, decaying to zero

Autoregressive model. Use the partial autocorrelation plot to help identify the order.

One or more spikes, rest are essentially zero

Moving average model, order identified by where plot becomes zero.

Decay, starting after a few lags Mixed autoregressive and moving average model.

All zero or close to zero Data is essentially random.

High values at fixed intervals Include seasonal autoregressive term.

No decay to zero Series is not stationary.

Page 31: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting40

000

-0.2

00.

000.

200.

400.

60A

utoc

orre

latio

ns

of m

enac men, lags(24)

010

000

2000

030

000

men

1989m1 1991m11990m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1

-0.4

0-0

.20

0 5 10 15 20 25Lag

Bartlett's formula for MA(q) 95% confidence bands

scatter men mdate, c(l)

Page 32: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting0

1000

020

000

DS

12.m

en

-0.4

0-0

.20

0.00

0.20

0.40

Aut

ocor

rela

tions

of S

12.m

en

-200

00-1

0000

DS

12.m

en

1989m11989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1

scatter ds12.men mdate, c(l)ac s12.men, lags(24)pac s12.men, lags(24)

-0.6

0-0

.40

-0.2

00.

000.

20P

artia

l aut

ocor

rela

tions

of S

12.m

en

0 10 20 30 40 50Lag

95% Confidence bands [se = 1/sqrt(n)]

-0.4

0

0 5 10 15 20 25Lag

Bartlett's formula for MA(q) 95% confidence bands

Page 33: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & ForecastingWe have ARIMA(0,0,0)(0,0,1,12) model.

Step 2: Estimation & Diagnostics

arima men, arima(0,0,0) sarima(0,0,1,12) noconstant

Page 34: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

arima men mail page phone print service, arima(0,0,0) sarima(0,0,1,12) noconstant

Page 35: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

predict res, res

corrgram res, lags(36)

Page 36: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & ForecastingStep 3: Prediction

predict fit, xb

line fit men mdate30

000

4000

00

1000

020

000

3000

0

1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1

xb prediction, one-step men

Page 37: STATA Training Session 3

Time Series Analysis & ForecastingTime Series Analysis & Forecasting

arima men mail page phone print service if tin(, 1990m1), arima(0,0,0) sarima(0,0,1,12) noconstant

2000

040

000

predict fit1,xb

predict fit2,xb dyn(m(1990m1))

-400

00-2

0000

020

000

1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1 1999m11989m1

xb prediction, one-step xb prediction, dyn(m(1990m1))

men

predict fit2,xb dyn(m(1990m1))

line fit1 fit2 men mdate

Page 38: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Panel Regression Model (Linear)

residual overall theis effect, randomor fixed theis iti

itiitit

v

vy

µµα +++= βx

� Panel data: also called cross-sectional time series data with multiple cases (people, nations, firms, etc) for two or more time periods.

� Cross sectional information: difference btw subjects, btw subject effects.

� Time series: changes within subjects over time, within-subject effects.

Page 39: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATAVariables

Cases(nt) x1 x2 x3 xj

11 . . . .

12 . . . .

… . . . .

1t . . . .

21 . . . .21 . . . .

22 . . . .

… . . . .

2t . . . .

31 . . . .

32 . . . .

… . . . .

3t . . . .

.

.

.

.

.

.

.

.

nt . . . .

Page 40: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Fixed, Between and Random Effect Models (Linear)

� Fixed effects regression is the model to use when you want to control for omitted variables that differ between cases but are constant over time.

STATA command xtreg with the fe option

� Regression with between effects is the model to use when you want to � Regression with between effects is the model to use when you want to control for omitted variables that change over time but are constant between cases.

STATA command xtreg with the be option

� Random effect model: some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects.

STATA command xtreg with the re option

Page 41: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Example Variable name Variable information

year 1997, 1998, 1999, 200

origin Flight’s origin

destin Flight’s destination

id Route identifier

dist Distance, in milesData: airfare.dta

passen Avg. passengers per day

fare Avg. one-way fare, $

bmktshr Fraction market, biggest carrier

Page 42: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

reg lfare ldistsq lpassen bmktshr

Exercise 2:

Keep obs for id=1 & id=2 only. Separate regression for each flight route. Draw scatters to observe the air fare changes over years for these two routes.

Page 43: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

-define dataset as panel data

tsset id year

-summarize panel data

xtsum lfare ldistsq lpassen bmktshr

Page 44: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Fixed effects: -regression with fixed effects command

xtreg lfare ldistsq lpassen bmktshr, fe

Fixed effect model answers: what is the effect of x when x changes within routes over time.

Page 45: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Between effects: -regression with between effect command

xtreg lfare ldistsq lpassen bmktshr, be

Between effect model answers: what is the effect of x when x is different btw routes.

Page 46: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Random effects: -regression with random effect command

Θ is a function of ui and vi

)}()1{()()1()( iitiiitiit vyy νθµθαθθ −+−+−+−=− βxθx

Random effect model answers:

1. what is the effect of x when x changes within routes over time

2. what is the effect of x when x is different btw routes

Page 47: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

xtreg lfare ldistsq lpassen bmktshr, re theta

Page 48: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Source of Variation Coefficients

Btw-Variation

Within-Variation

ldistsq lpassen bmktshr

Fixed Effects No Yes / -0.316 0.0647

Btw Effects Yes No 0.034 -0.066 0.316

Random Effects Yes Yes 0.029 -0.2235 0.096

Question: what’s the right model, fixed or random e ffect?

Page 49: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Hausman Test: H0: coefficients estimated by RE estimator are the same as those estimated by FE estimator.

xtreg lfare ldistsq lpassen bmktshr, feestimates store fixed

xtreg lfare ldistsq lpassen bmktshr, re

estimates store randomestimates store randomhausman fixed random

Page 50: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Breusch and Pagan LM test:

H0: sd(ui) = 0, where sd(ui) is the standard deviation of the ui terms

xttest0

Page 51: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Serial test of autocorrelation:

findit xtserial

xtserial lfare ldistsq lpassen bmktshr, output

Page 52: STATA Training Session 3

Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA

Exercise 3

1. To correct autocorrelation we fit FE-model with AR(1) disturbances. Run such model in STATA using command xtregar

2. In this data, we suspect based on the results from Exercise 1, that there is a period effect, i.e., after 1997 airfare gets increased in every flight route. Such ‘systematic’ shock introduces endogeneity. The FE estimator would be biased. To solve this problem, an intuitive way is to create dummy variables for year t>1997

Page 53: STATA Training Session 3

Thanks!