STATA Training Session 3
-
Upload
brook-denison -
Category
Documents
-
view
88 -
download
4
description
Transcript of STATA Training Session 3
STATA Training Session 3
Advanced Topics in STATA
Sun LiCentre for Academic Computing
OutlineOutline� Resources And Books
� Survival Analysis
� Kaplan-Meier Estimator
� Cox Regression
� Time Series and Forecasting
� Exponential Smoothing
� ARIMA Models
� Introduction to Panel Regression with STATA
Resources And BooksResources And BooksCAC Computing Resources for STATA users
� Windows: � STATA/SE version 10.0� 10-user network perpetual license� Installation guide
(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STAT(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA-Software Questions.aspx)
� Linux – CAC Beowulf Cluster:� STATA/SE version 10.0� Unlimited users� About CAC Beowulf Cluster:
(http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)
� New features in STATA 10.0 (http://www.stata.com/stata10)
Resources And BooksResources And Books� Website resources:
� The STATA website: http://www.stata.com
� The STATA journal – reviewed papers, regular columns, user-written software: http://www.stata-journal.com/
� STATA FAQ : http://www.stata.com/support/faqs� STATA User Support : http://www.stata.com/support� Books: http://www.stata.com/bookstore/
� CAC STATA support: � Website:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx
� Contact: � For statistical consultation: Sun Li: [email protected]� For software installation: TAN Suh Wen: [email protected]
Resources And BooksResources And Books
� Additional recommended readings:
� Econometric Analysis of Cross Section and Panel Data, Jeffrey M. Wooldridge
� An Introduction to Modern Econometrics Using Stata, Christopher F. BaumBaum
� New Introduction to Multiple Time Series Analysis, Helmut Lütkepohl
� Applied Survival Analysis: Regression Modeling of Time to Event Data, 2nd Edition, David W. Hosmer, Jr., Stanley Lemeshow, and Susanne May
� An Introduction to Survival Analysis Using Stata, Revised Edition, Mario Cleves, William W. Gould, and Roberto G. Gutierrez
Download Training Slides , data and Syntax:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/Training%20Slides%20and%20Syntax.aspx
Survival AnalysisSurvival Analysis
Survival Data
� Data: survival data is time-to-event data. It’s quantitative data corresponding to time from a well-defined time origin till the occurrence of some particular event of interest or endpoint.
� Reason of using survival model: � Reason of using survival model: � The distribution of survival data tends to be positively skewed and not likely
to be normal distribution and it may not be possible to find a transformation.� Time-varying covariates could not be handled. � In addition, some duration is censored.
� Censored observations: could be the event has not occurred at endpoint; lost to follow-up; withdraws from study; other interventions offered; event occurred but for unrelated cause; etc.
Survival AnalysisSurvival Analysis
Survival Model
Survival function:)(1)()( tFtTPtS −=≥=
)())(log(
)(
)( thtSdtf
th =−=>=Hazard function:
)( )(
)( thdttS
th =−=>=
))(exp()( tHtS −=
function. hazard cumulative is )(tH
Survival AnalysisSurvival AnalysisKaplan-Meier Estimator:
∏≤
−=ttj j
j
jn
dtS
)(|
)1()(ˆ )( jt
The number of individuals who experience the event at time
The number of individuals who have not yet experienced the event at time )( jt
)()2()1( .... nttt <<
Cox Regression:
iT
i
xii
Ti
xtHtH
tStSxthth iT
ββ β
+==>
==>=
)(log))(log(
)()()exp()()(
0
)exp(00
is the baseline hazard function.)(0 th
))(exp( jiT xx −β is the hazard ratio (HR) or incident rate ratio.
Survival AnalysisSurvival AnalysisSurvival Analysis in STATA telco.csv
Variable name Variable information
age Age in years
marital Marital status 0=unmarried 1=married
address Years in current address
income Household income in thousands
ed Level of educations1= didn’t complete high school 2= high school degree3= college degree 4= undergraduate 5= postgraduate
employ Years with current employer
reside Number of people in household
gender Gender 0=male 1=female
tenure Months with service
churn Churn within last month0 = ‘No’ 1=‘Yes’
custcat Customer categories1= basic service 2= E-service 3= plus service 4=total service
Survival AnalysisSurvival AnalysisDeclaring and summarizing survival-time data:
insheet using telco.csv
d
stset tenure, failure(churn)
_st: 1 if the record is to be used, 0 if ignored;
_d: 1 if failure, 0 if censored;
_t: analysis time when record ends;
_t0: analysis time when record begins.
Survival AnalysisSurvival Analysisstsum
ltable tenure churn
sts graph, by(custcat)
Kaplan-Meier survival estimates
0.00
0.25
0.50
0.75
1.00
0 20 40 60 80analysis time
custcat = 1 custcat = 2custcat = 3 custcat = 4
Kaplan-Meier survival estimates
Survival AnalysisSurvival AnalysisFitting regression models:
sw stcox age marital address income ed employ retire, pe(0.05)
xi : stcox employ address marital income i.custcat
test _Icustcat_2 _Icustcat_3 _Icustcat_4
Survival AnalysisSurvival Analysischar marital [omit] 1 char custcat [omit] 4
xi:stcox employ address i.marital income i.custcat, basesurv(s) basehc(h)
Survival AnalysisSurvival Analysisstcurve, survival
stcurve, hazard
stcurve, survival at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1)
stcurve, hazard at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1)Cox proportional hazards regression Cox proportional hazards regression_Icustcat_3=1)
.5.6
.7.8
.91
Sur
viva
l
0 20 40 60 80analysis time
Cox proportional hazards regression
.5.6
.7.8
.91
Sur
viva
l
0 20 40 60 80analysis time
_Icustcat_1=1 _Icustcat_2=1_Icustcat_3=1
Cox proportional hazards regression
Survival AnalysisSurvival AnalysisExamining the proportional hazards assumption:
stphplot, by(custcat)
6-ln
[-ln
(Sur
viva
l Pro
babi
lity)
]
The proportional-hazards assumption is not violated when the curves are
02
4-ln
[-ln
(Sur
viva
l Pro
babi
lity)
]
0 1 2 3 4ln(analysis time)
custcat = 1 custcat = 2custcat = 3 custcat = 4
the curves are parallel.
Survival AnalysisSurvival AnalysisExamining time-varying covariates:
xi : stcox employ address i.marital income i.custcat, tvc(employ)
estimates store model1
xi : quietly stcox employ address i.marital income i.custcat
lrtest model1 .
Survival AnalysisSurvival Analysis
Exercise 1
Repeat the above analysis by treating customer categoryas stratifying variable instead of a covariate.
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
Definitions, Applications and Techniques
Time series data: each case represents a point in time. Each cell gives a value for each variable for each time period.� Stationarity: Data are stationary. A stationary process has the property
that the mean, variance and autocorrelation structure do not change over time.
� Seasonality: By seasonality, we mean periodic fluctuations.
The usage of time series models is:� to obtain an understanding of underlying forces and structures that
produce the observed data.� to fit a model and proceed to forecasting and monitoring.
Techniques:� Exponential Smoothing� ARIMA Models
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
Exponential Smoothing
Four available model types:
� Simple. The simple model assumes that the series has no trend and no seasonal variation.
� Holt. The Holt model assumes that the series has a linear trend and no seasonal variation.
� Winters. The Winters model assumes that the series has a linear trend and multiplicative seasonal variation (its magnitude increases or decreases with the overall level of the series).
� Custom. A custom model allows you to specify the trend and seasonality components.
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
General form of models:
� Single Exponential Smoothing:
� Double Exponential Smoothing:
2 , )1()1( 22
12
1
≥−+−= −−
−−
=∑ tSyS t
it
it
it ααα
10 )1()(
10 ))(1(
11
11
≤≤−+−=≤≤+−+=
−−
−−
γγγααα
tttt
tttt
bSSb
bSyS
� Triple Exponential Smoothing:
10 )1()( 11 ≤≤−+−= −− γγγ tttt bSSb
Forecast )(
smoothing Seasonal )1(
smoothing Trend )1()(
smoothing overall )-(1
11
11
mLtttmt
Ltt
tt
tttt
tt-Lt
tt
LmbSF
LS
yL
bSSb
)b(SL
yS
+−+
−
−−
−−
+=
−+=
−+−=
++=
ββ
γγ
αα
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
ExampleData: seasfac.csv
Variable name Variable information
date Date
men Sales of men’s clothing
mail Number of catalogs mailed
page Number of pages in catalogs
phone Number of phone lines open for ordering
print Amount of spent on print advertising
seasonal_facors_men Seasonal Factors for Sales of Men's Clothing
year_ Year of the date
month_ Month of the date
Time Series Analysis & ForecastingTime Series Analysis & ForecastingStep 1: Understand your data
3000
040
000
insheet using seasfac.csvd
gen mdate=ym(year_, month_)
010
000
2000
0m
en
1989m1 1991m11990m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1
scatter men mdate, c(l) sort xlabel(, grid) ylabel(,grid)
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
Step 2: Declare time series data & Test stationarit y assumption
tsset mdate, monthly
list men d.men l.men in 1/10
dfuller men, regress trend
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
Step 3: Detect seasonality with autocorrelations
reg d.men l.men
predict res1,r
corrgram res1
wntestq res1wntestq res1
Both autocor and partial autocor charts show the structure of the annual seasonality of the time series.
Q-test for white noise: if the test is significant, the residuals are correlated.
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
Step 4: Construct Holt-Winters seasonal smoothing m odel
� Single-exponential smoothing
tssmooth exponential men1=men
Holt-Winters nonseasonal smoothing� Holt-Winters nonseasonal smoothing
tssmooth hwinters men2=men, from(.1 .1) iterate(100)
� Holt-Winters seasonal smoothing
tssmooth shwinters men3=men, sn0_0(seasonal_factors_men) from(.1 .1 .1) iterate(100)
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
3000
040
000
line men1 men2 men3 men mdate0
1000
020
000
1988m1 1990m1 1992m1 1994m1 1996m1 1998m1 2000m1mdate
parms(0.1057) = men hw parms(0.000 0.000) = menshw parms(0.013 0.138 0.000) = men men
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
Step 5: Predictions
2500
030
000
3500
040
000
4500
0
4000
0
line men3 men mdate
line men3 men mdate if mdate>467
2000
0
1999m1 1999m4 1999m7 1999m10 2000m1mdate
shw parms(0.013 0.138 0.000) = men men
010
000
2000
030
000
4000
0
1988m1 1990m1 1992m1 1994m1 1996m1 1998m1 2000m1mdate
shw parms(0.013 0.138 0.000) = men men
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
ARIMA Model – ARIMA(p, d, q)
� Autoregression (AR): p is the order of autoregression� Integration (I): d is the order of integration (differencing)� Moving-Average (MA): q is the order of moving-average
AR(p) model:AR(p) model:
MA(q) model:
ARIMA(p, d, q) model:
tptpttt AXXXX +++++= −−− φφφδ ...2211
qtqtttt AAAAX −−− −−−−+= θθθµ ...2211
t
q
i
iit
dp
i
ii ALXLL )1()1)(1(
11∑∑
==
+=−− θφ
SHAPE INDICATED MODEL
Exponential, decaying to zero Autoregressive model. Use the partial autocorrelation plot to identify the order of the autoregressive model.
Alternating positive and Autoregressive model. Use the partial autocorrelation plot
Time Series Analysis & ForecastingTime Series Analysis & ForecastingExampleStep 1: Identification of orders of ARIMA model
Alternating positive and negative, decaying to zero
Autoregressive model. Use the partial autocorrelation plot to help identify the order.
One or more spikes, rest are essentially zero
Moving average model, order identified by where plot becomes zero.
Decay, starting after a few lags Mixed autoregressive and moving average model.
All zero or close to zero Data is essentially random.
High values at fixed intervals Include seasonal autoregressive term.
No decay to zero Series is not stationary.
Time Series Analysis & ForecastingTime Series Analysis & Forecasting40
000
-0.2
00.
000.
200.
400.
60A
utoc
orre
latio
ns
of m
enac men, lags(24)
010
000
2000
030
000
men
1989m1 1991m11990m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1
-0.4
0-0
.20
0 5 10 15 20 25Lag
Bartlett's formula for MA(q) 95% confidence bands
scatter men mdate, c(l)
Time Series Analysis & ForecastingTime Series Analysis & Forecasting0
1000
020
000
DS
12.m
en
-0.4
0-0
.20
0.00
0.20
0.40
Aut
ocor
rela
tions
of S
12.m
en
-200
00-1
0000
DS
12.m
en
1989m11989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1
scatter ds12.men mdate, c(l)ac s12.men, lags(24)pac s12.men, lags(24)
-0.6
0-0
.40
-0.2
00.
000.
20P
artia
l aut
ocor
rela
tions
of S
12.m
en
0 10 20 30 40 50Lag
95% Confidence bands [se = 1/sqrt(n)]
-0.4
0
0 5 10 15 20 25Lag
Bartlett's formula for MA(q) 95% confidence bands
Time Series Analysis & ForecastingTime Series Analysis & ForecastingWe have ARIMA(0,0,0)(0,0,1,12) model.
Step 2: Estimation & Diagnostics
arima men, arima(0,0,0) sarima(0,0,1,12) noconstant
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
arima men mail page phone print service, arima(0,0,0) sarima(0,0,1,12) noconstant
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
predict res, res
corrgram res, lags(36)
Time Series Analysis & ForecastingTime Series Analysis & ForecastingStep 3: Prediction
predict fit, xb
line fit men mdate30
000
4000
00
1000
020
000
3000
0
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1
xb prediction, one-step men
Time Series Analysis & ForecastingTime Series Analysis & Forecasting
arima men mail page phone print service if tin(, 1990m1), arima(0,0,0) sarima(0,0,1,12) noconstant
2000
040
000
predict fit1,xb
predict fit2,xb dyn(m(1990m1))
-400
00-2
0000
020
000
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1 1999m11989m1
xb prediction, one-step xb prediction, dyn(m(1990m1))
men
predict fit2,xb dyn(m(1990m1))
line fit1 fit2 men mdate
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Panel Regression Model (Linear)
residual overall theis effect, randomor fixed theis iti
itiitit
v
vy
µµα +++= βx
� Panel data: also called cross-sectional time series data with multiple cases (people, nations, firms, etc) for two or more time periods.
� Cross sectional information: difference btw subjects, btw subject effects.
� Time series: changes within subjects over time, within-subject effects.
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATAVariables
Cases(nt) x1 x2 x3 xj
11 . . . .
12 . . . .
… . . . .
1t . . . .
21 . . . .21 . . . .
22 . . . .
… . . . .
2t . . . .
31 . . . .
32 . . . .
… . . . .
3t . . . .
…
…
.
.
.
.
.
.
.
.
nt . . . .
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Fixed, Between and Random Effect Models (Linear)
� Fixed effects regression is the model to use when you want to control for omitted variables that differ between cases but are constant over time.
STATA command xtreg with the fe option
� Regression with between effects is the model to use when you want to � Regression with between effects is the model to use when you want to control for omitted variables that change over time but are constant between cases.
STATA command xtreg with the be option
� Random effect model: some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects.
STATA command xtreg with the re option
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Example Variable name Variable information
year 1997, 1998, 1999, 200
origin Flight’s origin
destin Flight’s destination
id Route identifier
dist Distance, in milesData: airfare.dta
passen Avg. passengers per day
fare Avg. one-way fare, $
bmktshr Fraction market, biggest carrier
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
reg lfare ldistsq lpassen bmktshr
Exercise 2:
Keep obs for id=1 & id=2 only. Separate regression for each flight route. Draw scatters to observe the air fare changes over years for these two routes.
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
-define dataset as panel data
tsset id year
-summarize panel data
xtsum lfare ldistsq lpassen bmktshr
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Fixed effects: -regression with fixed effects command
xtreg lfare ldistsq lpassen bmktshr, fe
Fixed effect model answers: what is the effect of x when x changes within routes over time.
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Between effects: -regression with between effect command
xtreg lfare ldistsq lpassen bmktshr, be
Between effect model answers: what is the effect of x when x is different btw routes.
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Random effects: -regression with random effect command
Θ is a function of ui and vi
)}()1{()()1()( iitiiitiit vyy νθµθαθθ −+−+−+−=− βxθx
Random effect model answers:
1. what is the effect of x when x changes within routes over time
2. what is the effect of x when x is different btw routes
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
xtreg lfare ldistsq lpassen bmktshr, re theta
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Source of Variation Coefficients
Btw-Variation
Within-Variation
ldistsq lpassen bmktshr
Fixed Effects No Yes / -0.316 0.0647
Btw Effects Yes No 0.034 -0.066 0.316
Random Effects Yes Yes 0.029 -0.2235 0.096
Question: what’s the right model, fixed or random e ffect?
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Hausman Test: H0: coefficients estimated by RE estimator are the same as those estimated by FE estimator.
xtreg lfare ldistsq lpassen bmktshr, feestimates store fixed
xtreg lfare ldistsq lpassen bmktshr, re
estimates store randomestimates store randomhausman fixed random
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Breusch and Pagan LM test:
H0: sd(ui) = 0, where sd(ui) is the standard deviation of the ui terms
xttest0
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Serial test of autocorrelation:
findit xtserial
xtserial lfare ldistsq lpassen bmktshr, output
Introduction to Panel Regression with STATAIntroduction to Panel Regression with STATA
Exercise 3
1. To correct autocorrelation we fit FE-model with AR(1) disturbances. Run such model in STATA using command xtregar
2. In this data, we suspect based on the results from Exercise 1, that there is a period effect, i.e., after 1997 airfare gets increased in every flight route. Such ‘systematic’ shock introduces endogeneity. The FE estimator would be biased. To solve this problem, an intuitive way is to create dummy variables for year t>1997
Thanks!