survival analysis workflow...Survival Analysis Workflow assessing the impact of Macro-Economic...
Transcript of survival analysis workflow...Survival Analysis Workflow assessing the impact of Macro-Economic...
1
Survival Analysis Workflow assessing the impact of Macro-Economic Shocks on Credit Portfolios and predicting the Time of Default (By Anupam Saha, Software Specialist – Analytics, SAS R&D (India) Pvt. Ltd., and Naeem Siddiqi, Senior Solutions Specialist, SAS Institute Inc.)
Introduction
The recent global credit crisis has reiterated the need for bankers to focus on macro-economic factors (such as inflation, house price index, and so on) that severely impact the default behavior of credit portfolios. At the micro level, the response to these economic changes varies with individual capacity and tolerance. For example, although accounts within the same credit score or rating group show homogeneous behavior in default performance, the individual borrowers respond to macroeconomic shocks in a different manner than others in the same segment. This results in scattered default distributions within the predefined performance window. Because defaults are scattered across time, assessing the macro economic impact on a credit portfolio is a challenging task.
In credit scoring, the traditional probability-of-default (PD) model is used to estimate the individual likelihood to default. This involves generating a sample of new or existing accounts at one point in time and monitoring the performance of those accounts across a specified performance window to determine their default or non-default status. It is generally assumed that the probability of default is constant through the entire performance window, and the outcome of the model is a probability of default value (PD) that is valid for the full window. Because the defaults are scattered across time in the performance window, PD varies with respect to time within the same window. Along with the PD, the time of default can be a key piece of information for bankers, allowing them to create better account management strategies. The traditional methodology makes it difficult to calculate PD by different time points and the time of default within the specified performance window.
In this paper, we propose a survival analysis workflow for two purposes:
• to observe the impact of macro-economic shocks on default behavior of credit exposures
• to predict the time of default
We intend to use traditional PD models in conjunction with a survival analysis technique (explained later in this paper) for overcoming the challenges associated with traditional PD models. In this paper, we explain a survival model called “Cox Proportional Hazard Model” that can be used to establish a relationship between macro-economic variables and the hazard
2
function (that is, the rate of change of probability of default at time t) at different time points in a observation window. The survival distribution of default can be estimated by this model. The survival function is used to determine the time of default. Further analysis is then performed to obtain the expected time-dependent default count distribution per segment. Finally, variances of macro-economic factors are used to stress-test the credit portfolio.
This paper also gives insights into the data requirements, SAS procedures for survival analysis, and the techniques to validate and monitor the workflow for accuracy and consistency.
Benefits of Using Survival Analysis in Credit Scoring
1. Modelers in banks are aware of the fact that macro-economic variables affect probabilities of default that are generated using models. These macro-economic variables can change at any time during the observation period. However, such variables, and in particular, the leading indicators, have not been included in the traditional logistic regression models used in banking. The impact of macro-economic variables on the PD value can be gauged by considering those variables as time-dependent covariates in the survival analysis model.
The survival function of default as well as the time of default can be derived from that survival model. Time of default here refers to the survival time of a particular account. Survival time is the time from when the account was opened to the date of default. If the account does not default by the end of the observation period, that observation can be considered as a censored event.
For analyzing the survival data, the hazard function gives the rate of change of probability of default at time t. ℎ( ) = lim∆ → ( ∆ | )∆
The probability of survival at time t can be given in terms of the hazard function: ( ) = exp{− ℎ( ) }
This is the probability of survival at time t. In credit scoring scenario, this is the probability that an account has not defaulted by time t after the account was opened (that is, 1-PD at time t). If the interval censoring is considered, then this indicates the probability that an account has not defaulted by time t after the start of the observation period.
There are several ways to generate the hazard function. We will use the Cox Proportional Hazard (Cox PH) model because it allows the inclusion of macro-economic variables as time-dependent covariates in the model. By Cox PH model,
3
h (t) = λ (t) exp ( β X (t) + β X (t) + … β X (t))
where,
i is a subscript for observation i=1(1)n. h (t) = Hazard of the ith individual at time t.
λ0 (t) = Baseline hazard function. X (t) … … X (t) are the k time-dependent covariates of the ith individual at time t. β …..β are the regression parameters associated with the covariates, estimated by the Partial Likelihood method in Cox model.
For applying Cox PH model in the credit scoring environment, macro-economic variables should be captured at account default date, prior to default date (any lag period can be considered), at account open date, and at regular time intervals during the observation period. The change in macro-economic values between the account open date and the default date or prior to the default date (considering any lag period, for example, 6 months, 3 months, and so on) can also be considered as a time-dependent covariate in the model. The time of default and the default indicators in the observation window should also be captured. The survival probabilities for each of the individuals at failure time points can be estimated by using the baseline survival function and the parameter estimates β of β by maximum likelihood estimation and the macro-economic values at those time points. If a user changes the macro-economic values, the user can get the different survival probabilities of those accounts at default time points. In this way, the Cox PH model can help to see the impact of those variables on the probability of default.
2. Survival analysis can be used to determine the probability of default across time in an observation window. It is also possible to derive the survival function of default for future time points. This represents an advantage over the traditional PD-estimation methods, which limit the prediction to a predefined observation window. By generating PDs across time, a risk manager can adjust account treatment strategies according to an account’s risk level at that point in time, rather than rely on a single, long-term PD forecast that might not represent the account’s current risk levels.
Figure 1 shows an example of the probability of default across time in an observation window, as generated by using survival analysis. Figure 2 shows an example of the survival function of default across time in that observation window. Both the graphs are based on the same sample data. From Figure 1 and Figure 2, it is evident that as time passes, the probability of default increases, and the corresponding survival function decreases. At the early stages of the performance window, the change in the macro-economic values might not be significant to
cause dconsideof the phigh. Hodefault
Figure 1
Figure 2
efault evenrably, which
performanceowever, as tmight incre
1:
2:
ts. Howeveh might force window, ttime passes
ease, and th
r, as time pace the accouhe probabil, due to chae survival p
4
asses, thoseunts to defality of defauanges in ecorobability m
e economic ault. Thus, itult is low, anonomic condmight decrea
conditions t indicates tnd the survivditions, the ase.
might vary that at an eaval probabiprobability
arly stage lity is of
3. Bankeforecastbuild seinsight ihow thegives thsimple Psurvival
Figure 3subpopuaccountsegmenscore seaccount
Figure 3observesegmenFigure 4segmen
Figure 3
ers generallts for specif
eparate survinto how the PDs of fivee risk mana
PD predictio functions o
3 and Figureulations arets—that is, t
nt, the probaegments. Sets.
3 shows the ed that, for snts. Howeve4, it can be onts. Howeve
3:
ly build sevefic populatiovival modelse PDs of the
e separate sager deeperon for the wof default fo
e 4 are basee based on tthe highest-ability of de
egment E ha
trend of PDsegment A,
er, for segmeobserved th
er, for segme
eral segmenons in that ps for the vare various susubpopulatior insights int
whole perforor the variou
d on the samthe scores o-risk-profile
efault is highas the highe
D values for the distribu
ent E, the dhat for segment E, the su
5
nted modelsportfolio. Asrious segmeub-populatioons within ato account brmance winus segments
me sample of the accou accounts. T
her comparest-score acc
segments aution of PD istribution o
ment A, survurvival func
s for any givs an extensi
ents in the pons vary acra portfolio vbehavior thadow. Figures can be com
data. Let usnts. Segme
This means ed to the accounts—tha
across time.values acroof PD valuesival functiontion of defa
ven portfolioon of the ab
population tross time. Fivary across tan when coe 4 shows hmpared.
s assume thnt A has lowthat for the
ccounts thatat is, lowest
. In the figurss time is his across timn of defaultault is high.
o, to generabove, a useo generate igure 3 illusttime. Againmpared to oow the diffe
at the west-score e accounts int belong to t-risk-profile
re, it can beigh in all the
me is low. In t is low in al
ate better r can further trates , this one erent
n that higher-
e
e e the l the
Figure 4
4. Usingof defauderive tFigure 5monitorcohort a
Figure 5accountmeans tthe accoaccountparticulsegmen
4:
g the survivault for each he expected
5 shows an er the observanalysis use
5 is based onts. Segmentthat for the ounts that bts—that is, tar time poin
nt E.
al functions account, wd number oexample of vation of poed widely in
n sample dat A has lowe
accounts inbelong to hithe lowest-rnt, the perc
generated ithin all the
of defaults fosuch an ana
ortfolios willthe industr
ata. Let us aest-score accn that segmgher-score risk-profile
centage of d
6
by the surv segments. or each segalysis for fiv recognize t
ry.
ssume thatcounts—thaent, the prosegments. Saccounts. F
default is rel
ival modelsWe can thement in con
ve pools acrothis chart as
segments aat is, the higobability of Segment E hrom Figure atively decr
s, it is possiben use thesensecutive fuoss five tims being simi
are based oghest-risk-pdefault is hihas the high5, it is evide
reasing from
ble to derivee predictionuture time pe points. Thlar to vintag
n the scoresrofile accouigher compahest-score ent that, form segment A
e the time ns to points. hose who ge or
s of the unts. This ared to
r a A to
Figure 5
5. Becauimpact odefault changesvariable
Let us aFigure 6based o
Figure 6
5:
use macro-eof changes can be recas in default es, can be m
ssume that 6 shows the on certain m
6:
economic vain macro-ec
alculated at behavior, w
monitored.
five differeexpected p
macro-econo
ariables areconomic confuture time
which result
ent survival mpercentage oomic values.
7
e being usednditions on e points. Fig
due to chan
models havof default a.
d as time-dethe predicture 6 showsnges in five
ve been builtcross future
ependent coed default rs an exampselected m
t for five dife time point
ovariates herates and thle of how thacro-econo
fferent segmts for five se
ere, the he time of he mic
ments. egments
Now, a on this son the mupdatedmight b
Table 1:
Time
2345
The usesegmenfrom thtime. It percentappearsthe varichanges
Figure 7
user might sort of repomacro-econd macro-ecoe interested
:
Updated 1 <value> 2 <value> 3 <value> 4 <value> 5 <value>
er can then rnts for thosee survival fuis also poss
tage of defas different fation in pes in the mac
7:
be interesteort. Now, letnomic valuesonomic valud to see its i
Macro-econ
recalculate te future timunctions, it sible to geneult distributrom the earrcentage of
cro-econom
ed to see wt us assumes for future
ues are signiimpact on t
nomic Values
the survivale points, fois possible terate same tion across rlier one (Figf default acr
mic values.
8
hat impact e that the us
time pointsificantly diffhe default b
s
l function ofr updated mto predict wkind of repotime for thegure 6). Thiross time (w
the changesser has the cs from any tferent frombehavior (Ta
f default formacro-econowhich accouort (Figure 6e five segmes sort of ana
within the su
s in macro-ecapability totime series m the previouable 1).
r each accouomic valuesnts will defa6) that definents. This realysis can heub-populatio
economic vao perform itmodel. Nowus values, th
unt within as (Table 1). Tault and at wnes the expeeport (Figurelp bankersons), with a
alues has teration
w, if the he user
all Then, what ected e 7)
s to see any
Proce
Data rTherbe uand as thchanas it
ss Flow o
requiremere are no lim
used for thiswhether th
he followingnges in the t incorporat
infla
inte
unem
hous
cons
aver
stoc
hous
GDP
mon
of Surviva
ents mitations ons exercise. There is a busg are considindicator oves the direc
ation rate
rest rates a
mployment
sing price in
sumer price
rage person
k market in
sing starts a
P
ney supply
al Analys
n the numbeThe exact lissiness buy-indered. Althover a 3 to 6 ction in whic
nd bond yie
t rate and em
ndex
e index
al income g
dex
and similar s
9
sis in Cred
er or the nast will depenn for using t
ough the rawmonths perch that indic
elds
mployment
growth rate,
statistics su
dit Scorin
ature of macnd on the avthe data. In w value of eriod can be cator is mov
growth rat
, growth rat
ch as buildi
ng
cro-economvailability ofgeneral, lea
each indicatoa better preving.
e
te
ng permits
mic variablesf data in eacading indicaor can be usedictor of th
issued
s that can ch market ators such sed, he future,
10
manufacturing output
corporate profits and inventory levels
In addition to the macro-economic variables above, demographic (postal code, residential status, age, and so on) and behavioral variables (account-related variables) can also be considered. If the segmentation was based on these variables, then for the Survival model, only the macro-economic variables should be considered. The default time of each defaulted account should be captured in the modeling data. If the account does not default in the outcome period, then the entire outcome period should be considered as survival time for that account. The default indicator should also be considered in the modeling data. Macro-economic variables should be captured at account default date, prior to default date (any lag period can be considered), at account open date, and at regular time intervals during the observation period. The change in macro-economic values between the account open date and the default date or prior to the default date (considering any lag period, for example, 6 months, 3 months, and so on) can also be considered as time-dependent covariate in the model. For the survival model, interval censoring can be considered. If the defaults are captured for a specific time period, then the censoring technique can be considered as interval censoring. Table 2 shows the structure of a sample modeling analytical base table (ABT) for using survival analysis in a credit-scoring environment. Consider a sample data consisting of the following: 1. ID (account identification number) 2. Time (time of default of the account in a predefined observation period) 3. Default (censoring status where 1 = default and 0 = censored) 4. Consider that a time-dependent covariate is captured at default dates cov1 – cov6 (a covariate value at six default times in the observation period). covi denotes a covariate value at the ith time and i=1(1)6. For instance, the first account defaults at time3. Therefore, up to time 3, the covariate values have been captured. The second account defaults at time 4. Therefore, up to time 4, the covariate values have been captured. The third account does not default in the outcome period. Therefore, up to time 6, its covariate values can be captured. The third account will be considered a censored event.
11
Table2:
Account id time default cov1 cov2 cov3 cov4 cov5 cov6
1 3 1 <value> <value> <value> 2 4 1 <value> <value> <value> <value> 3 6 0 <value> <value> <value> <value> <value> <value>
Study the impact of macro-economic variables on default behavior (using Cox model by PROC PHREG)
Many types of models have been used for survival analysis. In this paper, the Cox proportional hazard model has been mentioned. In SAS, the PHREG procedure performs regression analysis of the survival data based on the Cox proportional hazard model.
1. To handle time-dependent covariate in PHREG procedure, the counting process style of input is used. The advantage of this format that user can easily get the residual and influence statistics. In this format, for each record, multiple records are crated, one record for each distinct pattern of the time-dependent measurements. Each record contains a T1 value and a T2 value, representing the time interval T1and T2 during which the values of the explanatory variables remain unchanged. Each record also contains the censoring status at T2.
Following is an example of a modeling data set that is created in counting process style of input format. PHREG can be run on this type of data set.
Table 3:
Account id Time default T1 T2 status covariate
1 3 1 0 1 0 <value> 1 3 1 1 2 0 <value> 1 3 1 2 3 1 <value>
2. Consider that we have a scoring data set, and we want to find the predicted survival probability of default across time for the accounts in that data set.
Table 4:
12
Account id time covariate
1 1 <value> 1 2 <value> 1 3 <value> 1 4 <value>
PROC PHREG can be run on the modeling data set (Table 3) and can generate survival probability for the accounts in the scoring data set (Table 4). Following is a sample SAS code for generating survival probabilities.
Data Zerocov; covariate= 0; Run;
proc sort data = scoredata; by time; run;
proc phreg data= inputdata; ods output ParameterEstimates = para_est FitStatistics = fit_stat ; model (T1,T2)*Status(0)=covariate; baseline covariates=Zerocov out=Basesurv Survival=Basesurvival; id ID Time Default; run; proc sql; select Estimate into:Estimate from para_est ; quit; data Scoredata(drop=lastS);
retain lastS 1;
merge Scoredata Basesurv(rename=(t2=time)drop= covariate);
by time;
if missing(Basesurvival) then Basesurvival=lastS;
else lastS= Basesurvival;
survival= Basesurvival ** exp(&Estimate*covariate);
run;
This wayfollowinprobabi
Table 5:
Accounid
A user cprobabimacro-e
3. One aresidual
proc p mode outp run;
In the duser pro
DifferenThe couwhich is
A. The M
Consideindicatecountin
y, it is possing table (Tabilities for ea
:
nt time
1 11 21 31 41 51 6
can change tility for eacheconomic va
advantage ols and influe
phreg da
el (T1,Tput out=
ev_est dataoceed on re
nt Survival unting proces known as t
Multiplicativ
er a set of n es the numbg process
ble to geneble 5) showch account
survival probabil
1 <value> 2 <value> 3 <value> 4 <value> 5 <value> 6 <value>
the covariath account foariables on
of using the ence statisti
ata= inpu
T2)*Statu= dev_est
a set, the vasidual analy
Methods ess formulatthe multipli
ve Hazards
accounts sober of obser
are step f
rate survivas the structacross time
lity
te values foor those timdefault beh
counting pics. To do th
utdata;
us(0)=cot xbe
riable “marysis of the m
tion enablescative haza
Model
o that the corved events unctions th
13
al probabiliture of the d
e:
or future timme points. Inhavior.
rocess formhis, you can
ovariate;ta = est
rting” contamodel.
s PROC PHRrds model.
ounting prothat occur oat have jum
ty for each adata set tha
me points ann this way, it
mulation is thuse the foll
; t resmar
ins marting
REG to fit a s
ocess over time t.
mps of size
account acrot contains t
nd can get dt is possible
hat you canlowing code
t= mart
gale residua
superset of
. The sample, with
oss time. Thhe survival
ifferent sure to see the
easily obtae:
ting ;
ls that can h
the Cox mo
for the ith e paths of t
. Let
he
rvival impact of
ain various
help the
odel,
account he
denote
the vect is de
where,
indotherwi
is t
is
B. Piece
In PieceThen, cohazard f
ConsideLet
When h
where,
The bas
where,
tor of unknonoted by
dicates whetise
the vector o
an unspecif
e-wise Const
e-wise Constonsidering afunction is g
ering single f
hazards are i
seline cumu
own regress
ther the th).
of explanato
fied baseline
tant Baselin
tant Baselina baseline hgenerated.
failure time
in original s
lative hazar
sion coeffici
h account de
ory variables
e hazard fun
ne Hazard M
ne Hazard Mazard value
e variable cabe a
cale, the ha
rd function i
14
ients. The m
efaults at tim
s for the th
nction.
Model
Model, the pe for each of
ase, let partition of
azard functio
is denoted b
multiplicativ
me (specif
h account at
period is divf the interva
f the time ax
on for accou
by
e hazards fu
fically,
t time .
vided into cals, the base
be xis.
unt is den
unction
if defau
ertain intereline cumul
the observe
oted by
for
ults,
vals. ative
ed data.
For usindata. Le
Then,
the form
The BAYbaselinecan be u
proc p ods mode baye run;
The codpartitionof event
Algor
After a mthe surv
The timTheoretto get a Now, if time mijudgme
ng counting et
.
should
mulation fo
YES statemee hazard moused.
phreg dataoutput Pa
el (T1,T2)es seed=1
de specifies ns the time ts in each in
rithm for P
model has bvival functio
e point at wtically, the s zero probathe predicteght be connt. It can va
process sty
be replaced
or the single
ent invokes odel (also kn
a= inputdaarameterEs)*Status(0piecewise
the numberaxis into th
nterval.
Predicting
been createon of default
which the susurvival probability of sured survival sidered the
ary industry
yle of input, be a partit
d with
e failure tim
a Bayesian nown as the
ata; stimates =0)=covariae=hazard (
r of intervalhe given num
g the Time
ed, it can be t for an acco
urvival probbability can rvival. To coprobability time of defto industry
15
let tion of the t
me variable
analysis of te piecewise
= para_estate; (n =interv
ls with consmber of inte
e of Defaul
used to preount can be
ability is zerbe zero. Hompensate tis less than fault. The th
y.
ime axis, w
e applies.
the Cox moexponentia
t ;
val);
stant baselinervals with a
lt
edict defaule generated
ro can be coowever, prathis, a user cor equal to
hreshold val
bhere
del or the pal model). Th
ne hazard raapproximate
ts. By using.
onsidered asctically, it mcan predefin
o the thresholue can be b
e the obser for all
piecewise cohe following
ates. PROC Pely equal nu
g the Cox PH
s the time omight not bene a threshoold value, thbased on bu
rved
onstant g code
PHREG umber
H model,
of default. e possible old value. hen that
usiness
Generalopenedaccountconsideobservatime wo
Validat1. A sta
modpast(preof abe pclos
Figure 8
This gradefaultseach oth
In additbetwee
A measu
lly, the surv. If the accot was openered for the
ation periodould be the
tion Technandard pracdeling procet can be useedicted timen account is
predicted. Ine to the act
8:
ph is baseds of accounther.
ion to a visun the actua
ure can be c
vival time ofount never ded to the ensurvival mo
d starts to thentire obse
ique for Prctice in any ess. For the ed. The surve to default)s known forn an ideal sctual time of
on sample ts are plotte
ual confirmal and the pr
calculated a
f an accountdefaults, thend of the obodel, then thhe date of dervation per
redicting Tpredictive mvalidation p
vival functio) based on ar the develocenario, thedefault.
data. The aed across tim
ation, someredicted obs
as:
16
t is measureen the surviservation phe survival tefault. If th
riod.
Time of Defmodeling prprocess, a van is applied
a specified topment datae predicted t
actual numbme. In an id
e statistics cservations.
ed from the val time caneriod. If thetime can bee account d
fault roject is to validation da on this dat
threshold vaa set. The timtime of defa
ber of defaueal case, bo
an be gener
date when n be measue interval ce measured
does not def
validate theata set froma set to gen
alue. The acme of defauault for an a
lts and the oth the plots
rated to me
the accounred from w
ensoring is from when fault then su
e outcome om some poinnerate scorectual time ofult of an accaccount sho
predicted ns should be
easure the d
nt was hen the
the urvival
of the t in the
es f default count can uld be
number of close to
difference
17
( ℎ ℎ − ℎ )
Where, n is the total number of accounts in the data.
The absolute difference of the actual and the predicted time of default can be considered. The discriminatory power of a time of default model gets better as the above measure gets closer to zero, indicating a closer match between the predicted and actual values.
Some other statistics (for example, Brier score) can also be generated in this scenario. A non-parametric test, such as median test, can also be performed to compare the median actual time of default with the median predicted time of default.
18
Acknowledgements
I would like to thank Mrs. Billie Anderson, Mr. Ying So, Mr. David Park, Mr. Rajesh Khanwelkar, and the SAS Credit Scoring for Banking team.
References
Paul D. Allison. Survival Analysis using SAS A Practical Guide
SAS/STAT 9.2 User’s Guide
Elisa T Lee, John Wenyu Wang. Statistical methods for survival data analysis
John P Klein, Melvin L. M. Moeschberger. Survival analysis: techniques for censored and
truncated data
Shenyang Guo. Survival Analysis
J D Kalbfleisch, Ross L. prentice. The statistical analysis of failure time data
Miller R.G. 1981 “Survival Analysis.” New York Wiley
Peto R. 1973. ”Experimental Survival curves for interval-Censored Data.” Applied Statistics 22:86-93
Alan B. Cantor. SAS Survival Analysis Techniques
Jeraid F. Lawless. Statistical models and methods for lifetime data
David W.Hosmer, stanleylemeshow. Applied Survival Analysis: Regression Modeling of Time to Event Data (Wiley Series in Probability and Statistics)