Estimation of non-Gaussian A ne Term Structure...
Transcript of Estimation of non-Gaussian A ne Term Structure...
Estimation of non-Gaussian AffineTerm Structure Models ∗
Drew D. Creal†
Chicago BoothJing Cynthia Wu‡
Chicago Booth
First draft: September 15, 2012This draft: September 24, 2013
Abstract
We develop a new estimation procedure for non-Gaussian affine term structure modelsthat uses linear regression to construct a concentrated likelihood function. Many pa-rameters are eliminated from the likelihood that often cause problems for conventionalestimation methods. Our approach consistently finds the maxima, whereas conven-tional approaches do not. It is more than 60 times faster, dropping estimation timefrom several hours to a couple minutes. Our method also works for models with ob-servable macroeconomic variables, hidden factors, and can allow for restrictions on thekey parameters of interest.
Keywords: affine term structure models; multivariate non-central gamma distribu-tion; concentrated log-likelihood; estimation; local maxima.
∗We thank Michael Bauer, Alan Bester, John Cochrane, Rob Engle, Jim Hamilton, Chris Hansen, KenSingleton and seminar and conference participants at Chicago Booth, NYU Stern, NBER Summer Institute,Chicago Booth Junior Finance Symposium, Bank of Canada, Kansas, and UMass for helpful comments. Bothauthors gratefully acknowledge financial support from the University of Chicago Booth School of Business.Cynthia Wu also gratefully acknowledges financial support from the IBM Faculty Research Fund at theUniversity of Chicago Booth School of Business.†The University of Chicago Booth School of Business, 5807 South Woodlawn Avenue, Chicago, IL 60637,
USA, [email protected]‡The University of Chicago Booth School of Business, 5807 South Woodlawn Avenue, Chicago, IL 60637,
USA, [email protected]
1
1 Introduction
We develop a family of discrete-time, non-Gaussian affine term structure models (ATSMs)
whose state vector is closed under admissible affine transformations. The market prices of
risk have an intuitive form analogous to popular Gaussian ATSMs. Our main contribution
is a new estimation procedure that uses linear regression to concentrate many parameters
out of the likelihood function. This dramatically reduces the dimension of the parameter
space. Our approach consistently finds the maxima, whereas conventional approaches do not.
Estimation time also drops from hours to minutes. We can include observable macroeconomic
variables or hidden factors and allow for restrictions on the key parameters of interest.
Our method makes it possible for us to study local maxima, explain why they exist, and
their economic implications. Estimating several popular models, we find that Gaussian and
non-Gaussian models fit the cross-section of yields equally well. Differences in the economic
implications between the models comes primarily from their time series dynamics. Finally,
we explain where the superior cross sectional information comes from, and demonstrate that
a small number of yields capture a large quantity of information.
ATSMs are popular among policy makers, practitioners, and academic researchers. They
are the workhorse models for pricing bonds, understanding the role of monetary policy,
and determining how macroeconomic shocks impact discount rates; for overviews, see Pi-
azzesi(2010), Duffee(2012), Gurkaynak and Wright(2012), and Diebold and Rudebusch(2013).
As the literature on ATSMs has developed over the last decade, there is a consensus that
estimation can be challenging. This point was emphasized by, among others, Duffee(2002),
Ang and Piazzesi(2003), Kim and Orphanides(2005), and Hamilton and Wu(2012b). Direct
maximization of the log-likelihood function can be problematic as yields are close to non-
stationary and the no-arbitrage restrictions place complicated non-linear constraints on the
parameters of the model. ATSMs also have many local maxima that carry different eco-
nomic implications. Only recently have reliable and transparent estimation methods been
developed for Gaussian ATSMs; see, Joslin, Singleton, and Zhu(2011), Christensen, Diebold,
2
and Rudebusch(2011), and Hamilton and Wu(2012b).
Although Gaussian ATSMs have provided important insights, they cannot capture con-
ditional heteroskedasticity.1 Introducing positive non-Gaussian state variables serving as
stochastic volatility factors unfortunately further complicates estimation. First, the stochas-
tic volatility factors have more complicated dynamics and are bounded from below by zero.
Second, in models with both factors, Gaussian and non-Gaussian factors are asymmetric
because the non-Gaussian state variables enter both the conditional mean and variance of
the Gaussian state variables. The interaction between these factors creates a plethora of
local maxima that are not present in models with only Gaussian factors.
Much of the literature estimating non-Gaussian ATSMs has been conducted in continuous-
time; see among others Duffie and Kan(1996), Dai and Singleton(2000), Duffee(2002), Cherid-
ito, Filipovic, and Kimmel(2007), Collin-Dufresne, Goldstein, and Jones(2008), and Aıt-
Sahalia and Kimmel(2010). With the exception of a few special sub-classes of models, the
transition densities of the state variables within a continuous-time model are not known
making it necessary to approximate the likelihood function. Le, Singleton, and Dai(2010)
extended the univariate discrete-time, non-Gaussian model of Gourieroux and Jasiak(2006)
to multiple factors, and provided a class of discrete-time models.
One contribution of this paper is to develop a family of discrete-time non-Gaussian mod-
els that encompass any admissible rotation of a multivariate discrete-time Cox, Ingersoll,
and Ross(1985) process, nesting other discrete-time models as one of the rotations.2 This
generalization helps to understand identification of the model. We derive the properties of
this class of models and show that all members within this family have a closed-form tran-
sition density. As an immediate result, the likelihood function is exact as well. The market
prices of risk are of the extended affine form of Cheridito, Filipovic, and Kimmel(2007). We
1 For example, people use Gaussian models to study how macroeconomic fundamentals impact interestrates (Ang and Piazzesi(2003)), the dynamics of term premia across different countries (Wright(2011)), andthe impact of quantitative easing by the Federal Reserve when the Fed funds rate is at its zero lower bound(Hamilton and Wu(2012a))
2In this paper, we do not consider the class of non-Gaussian ATSMs built from the non-central Wishartprocess of Gourieroux, Jasiak, and Sufana(2009).
3
demonstrate that they have an intuitive form, illustrating how an agent gets compensated
for facing both Gaussian and stochastic volatility shocks.
Our main contribution is a new estimation procedure based on a concentrated log-
likelihood that dramatically improves estimation. Most of the parameters governing the
conditional mean of the state variables can be concentrated out of the model analytically
using linear regression. This reduces the dimensionality of the optimization problem. More
importantly, these parameters can cause numerical problems as well as extremely slow con-
vergence to the optimum due to the near non-stationarity of interest rates. We also provide
the analytical gradient of the log-likelihood.
Our method outperforms conventional approaches both in terms of stability of conver-
gence and speed. A Monte Carlo study shows that our method guarantees convergence as
long as it is locally identified, and it converges to a number of local maxima repeatedly. Aside
from being able to find the global maximum, our method helps us to locate and understand
the economic implications of different local maxima. Conversely, the conventional method of
directly maximizing the original likelihood never converges fully to any of the local maxima,
nor does it converge to the same point twice in repeated trials even when it is initialized
under the same local mode. This makes it difficult for researchers to differentiate between
points near a well-behaved local maximum having the same economic meaning and locations
corresponding to local maxima that are economically different. The median time it takes for
our new procedure is less than 2 minutes for a three factor model with one non-Gaussian
factor, whereas the conventional approach takes over 2 hours.
Using our method, we shed light on how local maxima with different economic impli-
cations are created in non-Gaussian models. In Gaussian models, different rotations of the
factors (such as re-ordering of the factors) result in equivalent global maxima, with identical
economic implications. Researchers impose identifying restrictions to isolate one of these
global maxima. In non-Gaussian models, rotations such as re-ordering factors can have sub-
stantial economic impacts. The non-Gaussian state variables must be positive and enter the
4
conditional variance. This creates an asymmetry between the Gaussian and non-Gaussian
factors resulting in many local maxima that are not equivalent. Any economic conclusions
drawn from the models must be made with care as they can vary widely for different local
maxima.
We apply our estimation method to several popular models with three and four factors.
We find that Gaussian and non-Gaussian models fit the cross-section of yields almost iden-
tically because (i) the bond loading recursions are identical up to Jensen’s inequality; and
(ii) as the measurement errors in the cross-section of yields are small, an efficient estimator
(like maximum likelihood) will use the inverse of the variance as the weighting matrix to em-
phasize the fit of the cross section. The fact that the cross-sectional fit is the same suggests
that any differences in economic conclusions between Gaussian and non-Gaussian models
must be driven by the differences in their time series properties. Gaussian models are more
flexible in their conditional mean. Consequently, term premia are more flexible in Gaussian
models, especially over long horizons. Non-Gaussian models can track the broad trend in
volatility of yields but they do not capture the detailed variation that reduced-form models
can capture. In our data set, the volatility factor is the level factor, because in the 1970’s,
interest rates were high when volatility was high. Conventional wisdom in finance suggests
that the short rate serves as the volatility factor but we find that estimates of conditional
volatility are more closely related to the long rate.
We also illustrate where the cross section information is coming from and show that a
small number of yields in the cross section contain a large amount of information. This is
because the bond loadings are a high powered polynomial function of the risk-neutral autore-
gressive coefficients. The bond loadings are sensitive to small changes in these parameters,
which causes the parameters to be estimated more precisely. This contrasts with the popular
view that the superior information comes from the large number of cross-sectional yields at
any time period.
This paper continues as follows. In Section 2, we specify a general class of discrete-time,
5
non-Gaussian affine term structure models. In Section 3, we describe our new approach to
estimation showing how to construct the concentrated likelihood. Section 4 describes the
data and identification of the models. Section 5 studies a three factor non-Gaussian model
in depth. In Section 6, we study several three and four factor Gaussian and non-Gaussian
models. In Section 7, we discuss directions for future research and conclude.
2 Model
In this section, we describe a class of discrete-time, non-Gaussian ATSMs and show that the
state vector is closed under affine transformations. In addition, all models within this family
have closed-form transition densities.
2.1 Factor dynamics
The H×1 vector of non-Gaussian state variables ht+1 captures the volatility. Their stochastic
process is obtained by taking an affine transformation to the exact discrete-time equivalent
of a multivariate Cox, Ingersoll, and Ross(1985) process. The model for ht+1 under the
physical measure P is
ht+1 = µh + Σhwt+1 (1)
wi,t+1 ∼ Gamma (νh,i + zi,t+1, 1) , i = 1, . . . , H (2)
zi,t+1 ∼ Poisson(e′iΣ
−1h ΦhΣhwt
), i = 1, . . . , H (3)
where νh = (νh,1, . . . , νh,H) are shape parameters, Φh is a matrix controlling the autocorrela-
tion of ht+1, Σh is a scale matrix, and µh is a vector determining the lower bound of ht+1. We
use ei to denote the i-th column of IH . To guarantee that ht+1 remains positive, all elements
of µh and Σh must be non-negative. To ensure that the mean of the Poisson distribution is
non-negative, Σ−1h ΦhΣh must be non-negative.
6
The conditional mean of ht+1 can be written in matrix form as
E (ht+1|It) = (IH − Φh)µh + Σhνh + Φhht
where It stands for the agent’s information set at time t. It is a linear function of its own lag
ht, similar to a vector autoregression. The conditional variance V (ht+1|It) is also an affine
function of ht
Σh,tΣ′h,t = Σhdiag(νh − 2Σ−1
h Φhµh)Σ′h + Σhdiag
(2Σ−1
h Φhht)
Σ′h
Gourieroux and Jasiak(2006) built the univariate version of this model and Le, Singleton,
and Dai(2010) extended it to (1)-(3) with µh = 0 and Σh diagonal. In this model for ht+1,
shocks are allowed to be correlated through the off-diagonal elements of Σh and the process
may have a negatively correlated drift through the off-diagonal elements of Φh. In Appendix
A.2, we provide the transition density of ht+1 for any admissible rotation.
The G × 1 vector of conditionally Gaussian state variables gt+1 follows a vector autore-
gression with conditional heteroskedasticity
gt+1 = µg + Φggt + Φghht + Σghεh,t+1 + εg,t+1, εg,t+1 ∼ N(0,Σg,tΣ
′g,t
), (4)
Σg,tΣ′g,t = Σ0,gΣ
′0,g +
H∑i=1
Σi,gΣ′i,ghit,
εh,t+1 = ht+1 − E (ht+1|It)
where Σi,g are lower triangular for i = 0, . . . , H. The Gaussian factors are functions of the
non-Gaussian state variables through both the autoregressive term Φghht and the covariance
term Σghεh,t+1. The conditional variance of gt+1 is also a function of the non-Gaussian
factors, which introduces conditional heteroskedasticity into bond prices.
A nice property of the model (1)-(4) for the vector xt+1 =(h′t+1, g
′t+1
)′pricing bonds is
that any admissible affine transformation remains within the same family of distributions.
7
Proposition 1 Let xt = (h′t, g′t)′ follow the process of (1)-(4) with parameters θ. Consider
an admissible affine transformation of the form
ht
gt
=
ch
cg
+
Chh Chg
Cgh Cgg
ht
gt
.
The new process xt =(h′t, g
′t
)′remains in the same family of distributions under updated
parameters θ. The parameters νh and Σ−1h ΦhΣh are invariant to rotation. The admissibility
restrictions and the relationship between the new and old parameterizations can be found in
Appendix E.1.
Proof: See Appendix E.1.
This proposition helps to understand identification in Section 4.2. The admissibility con-
straints ensure that the non-Gaussian state variables always remain positive after applying a
transformation from xt to xt and that there exists another admissible rotation from xt back
to xt.
Analogous to the popular class of Gaussian ATSMs, we specify xt = (g′t, h′t)′ to have the
same dynamics under both P and Q. We allow the parameters controlling the conditional
mean to be different under the two probability measures and set the scale parameters Σgh,Σh
and Σi,g for i = 0, . . . , H to be the same. The location parameter µh must also be the same
under both measures to ensure no-arbitrage.
2.2 Stochastic discount factor
In this section, we demonstrate how an agent gets compensated for risk exposure when
holding a zero-coupon bond under stochastic volatility. The detailed derivation of the log
of the stochastic discount factor (SDF) can be found in Appendix B. We decompose the
8
log-SDF into the risk free rate plus three components describing the risk compensation
mt+1 = −rt −1
2λ′gtλgt − λ′gtεg,t+1 − λ′wtεw,t+1 − λ′ztεz,t+1
where εi,t+1 are standardized shocks with mean zero and identity covariance matrix, and
λit is the price of risk i for each of the three types of shocks in the model. In addition to
the risk-free rt, the agent gets compensated for being exposed to the Gaussian shock εg,t+1
in equation (4), the gamma shock εw,t+1 in equation (2), and the Poisson shock εz,t+1 in
equation (3). The prices of these risks are defined as
λgt = V(gt+1|It, ht+1, zt+1)−1/2[E (gt+1|It, ht+1, zt+1)− EQ (gt+1|It, ht+1, zt+1)
]λwt = V(wt+1|It, zt+1)−1/2
[E(wt+1|It, zt+1)− EQ(wt+1|It, zt+1)
],
λzt = V(zt+1|It)−1/2[E(zt+1|It)− EQ(zt|It)
].
The market prices of risk have an intuitive form as the Sharpe ratio measuring per unit
risk compensation. Specifically, they are the difference in the conditional means of each
shock under P and Q standardized by a conditional standard deviation. The time-varying
quantities of risk are a feature of non-Gaussian models that are not available in Gaussian
models.
2.3 Bond prices
The price of a zero-coupon bond with maturity n at time t is the expected price of the same
asset at time t+ 1 discounted by the short rate rt under the risk neutral measure
P nt = E
Qt
[exp (−rt)P n−1
t+1
].
9
The short rate is a linear function of the state vector
rt = δ0 + δ′1,hht + δ′1,ggt.
This leads to bond prices that are an exponentially affine function of the state variables
P nt = exp
(an + b′n,hht + b′n,ggt
).
The bond loadings an, bn,h and bn,g can be expressed recursively in matrix notation as
an = −δ0 + an−1 + µQ′g bn−1,g +[µh − ΦQh µh + Σhν
Qh
]′bn−1,h +
1
2b′n−1,gΣ0,gΣ
′0,g bn−1,g
+µ′hΦQ′h Σ−1′
h
(IH −
[diag
(ιH − Σ′hbn−1,gh
)]−1)
Σ′hbn−1,gh
−νQ′h[log(ιH − Σ′hbn−1,gh
)+ Σ′hbn−1,gh
](5)
bn,h = −δ1,h + ΦQ′ghbn−1,g + ΦQ′h bn−1,h +1
2
(IH ⊗ b′n−1,g
)ΣgΣ
′g
(ιH ⊗ bn−1,g
)−ΦQ′h Σ−1′
h
(IH −
[diag
(ιH − Σ′hbn−1,gh
)]−1)
Σ′hbn−1,gh (6)
bn,g = −δ1,g + ΦQ′g bn−1,g (7)
where ΣgΣ′g is a GH × GH block diagonal matrix with diagonal elements Σi,gΣ
′i,g for i =
1, . . . , H and bn−1,gh = Σ′ghbn−1,g + bn−1,h. The loadings must satisfy the restriction that the
i-th component of Σ′hbn−1,gh < 1 for i = 1, . . . , H. The initial loadings at maturity one are
a1 = −δ0, b1,g = −δ1,g and b1,h = −δ1,h. The derivation of these expressions is available in
Appendix C.
Bond yields ynt ≡ − 1n
log (P nt ) are linear in the factors
ynt = an + b′n,hht + b′n,ggt
with an = − 1nan, bn,h = − 1
nbn,h and bn,g = − 1
nbn,g. Stacking ynt in order for N different
maturities n1, n2, ..., nN gives Yt = A + Bxt where A = (an1 , . . . , anN)′, B = (b′n1
, ..., b′nN)′.
10
If more yields are observed than the number of factors (N > G + H), not all yields can be
priced exactly. We make the standard assumption in the ATSM literature that N1 = G+H
linear combinations of the yields Y(1)t = SY1Yt are priced without error and the remaining
N2 = N − N1 linear combinations Y(2)t = SY2Yt are observed with Gaussian measurement
errors. Given this assumption, the observation equations are
Y(1)t = A1 +B1xt Y
(2)t = A2 +B2xt + ηt ηt ∼ N (0,Ω) (8)
where A1 ≡ SY1A,A2 ≡ SY2A,B1 ≡ SY1B, and B2 ≡ SY2B.
3 Estimation methodology
In this section, we introduce a new estimation method for any identified model. In Section
3.2, we illustrate how the basic idea can be applied to a wide range of ATSM’s including
models with observable macroeconomic variables and hidden factors.
3.1 Concentrated likelihood estimation
Given the parameters of the model θ, the likelihood function is
p(Y
(1)1:T , Y
(2)1:T ; θ
)= |det (J)|−(T−1)
T∏t=1
p(Y
(2)t |It; θ
) T∏t=1
p (gt|ht, It−1; θ) p (g0|h0; θ)
T∏t=1
H∏i=1
p (hit|It−1; θ) p (h0; θ) (9)
where J is the Jacobian of the transformation from xt = (g′t, h′t)′ to Y
(1)t . We have used the
fact that the pricing equation (8) can be inverted
xt = B−11
(Y
(1)t − A1
)gt = Sgxt ht = Shxt (10)
11
and the Gaussian and non-Gaussian factors can be selected out by the matrices Sg and Sh.
Expressions for the log-likelihood ` (θ) are available in Appendix D.3 Direct maximization
of the log-likelihood is however extremely challenging as interest rates are close to non-
stationary, the bond loadings are non-linear functions of the models’ parameters, and the
maximization must impose the condition that ht > 0.
The key insight to improving estimation methods for this class of models is to recognize
that it is possible to concentrate out a large subset of the parameters from the log-likelihood
by linear regression. Specifically, it is possible to concentrate out the parameters entering
the conditional mean of the Gaussian state variables (µg,Φg,Φgh) as well as the covariance
matrix Ω in (8). The parameter vector can be split into two sub-vectors θ = (θc, θm) which
are the parameters that can be concentrated out θc = (µg,Φg,Φgh,Ω) and the remaining
parameters θm that will be maximized numerically. The method we propose is a result of
the following proposition.
Proposition 2 Let θc = (µg,Φg,Φgh,Ω) and θm contain the remaining parameters of the
model. For the general affine model, the maximum likelihood estimator θ =(θc, θm
)can be
obtained by the following procedure.
(1.) Given θm, maximize the conditional log-likelihood to obtain θc (θm) = argmaxθc
` (θc, θm).
The first-order conditions for this problem can be solved analytically as follows.
(a.) Given θm, calculate the bond loadings A and B and the state variables gt and ht
from xt = B−11
(Y
(1)t − A1
).4
3The stationary distribution p (g0, h0; θ) = p (g0|h0; θ) p (h0; θ) is only known for special sub-classes ofthe affine family of models. In this paper, we will assume a diffuse initial condition and start from t = 2.When the stationary distribution of the state vector is known, including the initial condition is easy to doafter initial estimation of the model. While including the initial conditions enforces stationarity, it alsohas a potential negative impact on the estimates of the autoregressive parameters as it can increase theirdownward bias; see, e.g. Bauer, Rudebusch, and Wu(2012).
4During estimation, we impose B1 to be invertible and ht to be positive.
12
(b.) Given gt and ht, calculate εh,t+1 and Σg,t. Run a GLS regression
gt+1 − Σghεh,t+1 = µg + Φggt + Φghht + Σg,tεg,t+1
to calculate µg (θm) , Φg (θm) , Φgh (θm).
(c.) Calculate the covariance matrix
Ω (θm) =1
T − 1
T∑t=2
(Y
(2)t − A2 −B2xt
)(Y
(2)t − A2 −B2xt
)′
(2.) Substitute θc (θm) = (µg (θm) , Φg (θm) , Φgh (θm) , Ω (θm)) into the original log-likelihood.
Maximize the concentrated log-likelihood function θm = argmaxθm
`(θc (θm) , θm
).
Proof: See Appendix E.2.
The dimension of the optimization problem decreases dramatically as the concentrated log-
likelihood function `(θc (θm) , θm
)is only a function of θm. Restrictions on Ω are also
possible.
The intuition behind this result is that given θm the bond loadings can be calculated and
the factors gt and ht are conditionally observable from (10). Once the factors are observed,
the first-order conditions for the parameters (µg,Φg,Φgh) can be solved analytically in part
(1.b.) because they enter the log-likelihood only in the P dynamics as quadratic functions of
the state variables. Solving the subset of first order conditions for these parameters in terms
of gt and ht is equivalent to running the generalized least squares (GLS) regression defined
by the conditionally Gaussian factor dynamics.
Of critical importance is the fact that the parameters being concentrated out µg,Φg,Φgh
have the potential to cause problems during estimation. These P parameters govern the time
series dynamics of the state variables. As yields are close to non-stationary, some factors are
also close to non-stationary.
The dimension of θc depends on the number of factors G and H as well as the rotation
13
of the state vector xt chosen by the researcher. This is because the number of estimable
parameters entering the matrices (µg,Φg,Φgh) depends on the rotation. Making these full
matrices maximizes the number of parameters that can be concentrated out. There are
multiple rotations of xt that accommodate this. If one considers a rotation of the state
vector such that these are not full matrices, this sub-set of the parameters can still be
concentrated out.
In Appendix F, we also derive the analytical gradients of the concentrated log-likelihood.
Our derivation shows how the gradients for affine models can be decomposed into pieces
according to whether a parameter enters the bond loadings, the P dynamics, or both.
Proposition 3 The gradient of the concentrated log-likelihood `(θc (θm) , θm
)can be decom-
posed into three terms:
d`(θc (θm) , θm, A (θm) , B (θm)
)dθ′m
=∂`(θc, θm, A,B
)∂θ′m
+∂`(θc, θm, A,B
)∂A′
∂A (θm)
∂θ′m
+∂`(θc, θm, A,B
)∂vec (B′)′
∂vec(B (θm)′
)∂θ′m
.
The first term is the partial derivative of the P dynamics and Jacobian with respect to θm.
This measures the direct effect parameters have on the log-likelihood through the time series
of the factors. The second and third terms measure the indirect effect parameters have on
the log-likelihood through the bond loadings A and B.
Proof: See Appendix F.2
The expressions for the gradient can be used for other affine models such as models for
defaultable bonds and credit default swaps. Standard errors and other model diagnostics
also benefit from the analytical gradient.
14
3.1.1 Approximately concentrating out other parameters
It is possible to “approximately” concentrate out other parameters that either govern the P
dynamics of ht such as Φh or the scale parameters Σgh. All our final results are based on the
exact concentrated likelihood. But, for complicated models, this reduces the dimensionality
of the maximization problem and it provides excellent starting values for the full estimation.
The matrix Φh only enters the likelihood through the P dynamics. Using the results in
Appendix H, the non-Gaussian state variables can be written as a VAR(1) with conditionally
heteroskedastic, non-Gaussian shocks. This only requires adding step (d.) to Proposition 2.
(d.) Given ht, calculate Φh from the GLS regression
ht+1 − h = Φhht +
√diag
(Σh,ii
2hii,t
)εh,t+1 E [εh,t+1] = 0 V [εh,t+1] = IH
where h is an intercept. This step effectively utilizes a QML-type approximation to the
dynamics of ht+1|ht by assuming that the non-Gaussian errors are Gaussian.
The parameter Σgh enters the log-likelihood through both the bond loadings A and B
and the dynamics of gt+1. However, it enters the bond loadings only through the Jensen’s
inequality terms whose impact on the bond loadings is small unless very long maturities
are added. Information on the parameter Σgh accumulates primarily from the dynamics of
gt+1. The approximate concentrated log-likelihood can be calculated by evaluating the bond
loadings in part (1.a.) with Σgh = 0 and replacing the regression in part (1.b.) with
(b.’) Given the history of gt and ht, calculate εh,t+1 and Σg,t. Run the GLS regression
gt+1 = µg + Φggt + Φghht + Σghεh,t+1 + Σg,tεg,t+1
to calculate µg (θm) , Φg (θm) , Φgh (θm) , Σgh (θm).
This step is useful models with multiple non-Gaussian factors.
15
3.2 Examples
In this section, we discuss how our approach can be applied to several prominent ATSMs.
Example #1: observable macroeconomic variables
Our approach can be used in models with both yield factors and observable macroeco-
nomic variables. Our procedure works the same as before except the state vector xt now
contains the yield factors as well as the observed macroeconomic factors. For step (1.a.) of
Proposition 2, we use Y(1)t to back out the latent component of xt. Shocks to macroeconomic
variables may depend on the latent non-Gaussian factors and be heteroskedastic. A descrip-
tion of the model with macroeconomic variables is provided in Appendix G.2. A special case
of this model with no non-Gaussian factors and no feedback from the latent factors to the
macroeconomic variables is Ang and Piazzesi(2003).
More parameters are identified in these models as long as sufficiently many additional
yields in Y(2)t are available; see, e.g. Hamilton and Wu(2012b) for a discussion of this for
Gaussian ATSMs. Fortunately, many of the parameters that have been added to the model
can be concentrated out of the likelihood function. The parameters concentrated out govern
the time series dynamics of the observed macroeconomic variables, which are often highly
persistent and cause problems during estimation.
Example #2: Gaussian models
Multi-factor Gaussian models are one of the most widely applied tools for conducting
monetary policy and are implemented at central banks around the world. Our approach for
Gaussian models is particularly simple and in our experience these models can be reliably
estimated in a few seconds for a range of rotations with and without restrictions.
A full description of the Gaussian model is provided in Appendix G.3. The parameters
in each of the sub-vectors are θc = (µg,Φg,Ω) and θm =(δ0, δ1g, µ
Qg ,Φ
Qg ,Σ0,g
). In models
with only Gaussian factors, the regression in part (1 b.) in Proposition 2 simplifies to OLS
16
(1 b.) Given gt, calculate µg (θm) , Φg (θm) by running the OLS regression
gt+1 = µg + Φggt + Σ0,gεg,t+1
All other steps of the estimation procedure are the same as before.
Example #3: parameter constraints
Researchers often impose restrictions on parameters of an ATSM. Constraints of economic
interest typically center on the relationship between the conditional means of the state vector
xt across the P and Q measures, see Cochrane and Piazzesi(2008) and Bauer(2011). This
allows information from the cross-section of yields to be exploited to estimate the time series
parameters. Constraints can also eliminate parameters that are statistically insignificant;
see, e.g. Ang and Piazzesi(2003) and Kim and Wright(2005). In our approach, a researcher
can impose constraints directly on the P and Q parameters within a non-Gaussian model
and still concentrate out parameters by linear regression.
We denote the penalized or constrained log-likelihood function `p (θ) as
`p(θ) = log p(Y
(1)1:T , Y
(2)1:T ; θ
)+ p(θ),
where p(θ) is the penalty term. For example, when the constraints can be written as linear
functions of the ATSM’s parameters such as λi = µg,i − µQg,i = 0 and Λij = Φg,ij − ΦQg,ij = 0,
the penalty term is just a vector of Lagrange multipliers. Concentrating parameters out of
the log-likelihood is equivalent to running a constrained GLS regression. Another attractive
approach for incorporating prior information about the dynamics of the factors is to apply
a shrinkage estimator such as ridge regression to µg,Φg,Φgh, in which case the penalty
term p(θ) is a quadratic function of µg,Φg,Φgh. For example, a researcher may want to
shrink the parameters governing the dynamics of the factors under P toward a unit root in
order to counteract the small-sample downward bias of the autoregressive coefficients, Bauer,
17
Rudebusch, and Wu(2012). Alternatively, a researcher can shrink the P parameters toward
their counterparts under the Q measure µQg ,ΦQg ,Φ
Qgh.
Example #4: “hidden” factors
It is well-known since Litterman and Scheinkman(1991) that three factors explain the
majority of variation in the cross-section of yields. Recently, Duffee(2011) argued that more
than three factors are needed to explain the time-series dynamics of yields and risk premia.
These additional factors are “hidden” from the cross-section of yields because the factors
are not priced. The hidden factors are nevertheless part of the P dynamics. For simplicity,
we illustrate the basic ideas here for Gaussian models as in Duffee(2011) and leave details
of the general non-Gaussian model in Appendix G.4.
The Gaussian state vector can be separated into sub-vectors xt+1 =(g′1,t+1, g
′2,t+1
)′whose
dimensions are G1 × 1 and G2 × 1, respectively. The dynamics under the P measure are
g1,t+1 = µg,1 + Φg,11g1t + Φg,12g2t + ε1,t+1 ε1,t+1 ∼ N(0,Σ0,gΣ
′0,g
)(11)
g2,t+1 = µg,2 + Φg,21g1t + Φg,22g2t + ε2,t+1 ε2,t+1 ∼ N (0, IG2) (12)
The dynamics of gt+1 are the same under the Q measure but with the restrictions that
ΦQg,12 = 0 and the last G2 entries of δ1g are zero. These restrictions imply that only g1,t
directly impacts yields as the bond loadings on g2,t are zero by construction.
Given the subset of parameters that enter the bond loadings, the factors that price bonds
are conditionally observable through the transformation g1,t = B−11
(Y
(1)t − A1
)just as in
step (a.) of Proposition 2. We can now treat g1,t+1 as the observed data and (11) is the new
observation equation for a linear, Gaussian state space model. The remaining state variables
g2t have transition equation (12) and are just serially correlated shocks to the factors g1t
that price bonds. In our procedure, step (1.b.) of Proposition 2 is replaced by the Kalman
filter, which is equivalent to a GLS regression where the errors are serially correlated. To
concentrate the parameters (µg,1, µg,2,Φg,11,Φg,21) out of the likelihood, we can either place
18
them in the state vector or use the augmented Kalman filter of de Jong(1991), see also
Chapter 5 of Durbin and Koopman(2012). The Kalman filter delivers the concentrated
log-likelihood∑T
t=2 log p (g1t|g1t−1; θm), associated with the P dynamics of the model.
Example #5: observable yield factors
Another special case are models in which the state variables xt are chosen a priori by
the researcher and are therefore observable. In most applications, the factors are the linear
combination of yields priced without error xt = Y(1)t . This creates restrictions on the Q
parameters within the bond loadings because by construction A1 = SY1A = 0 and B1 =
SY1B = I; see, e.g. Joslin, Singleton, and Zhu(2011) and Hamilton and Wu(2012c). For
Gaussian models with observable factors, our procedure will coincide with Joslin, Singleton,
and Zhu(2011).
Working with observable state variables in non-Gaussian models has some practical dif-
ficulties that are not present for Gaussian models. A researcher must know a priori exactly
which linear combination of yields are Gaussian and which are non-Gaussian. This is diffi-
cult to define in practice because it is not a priori clear which factor(s) (e.g. level, slope, or
curvature) are the stochastic volatility factors.5
4 Data and identification
4.1 Data
We use the Fama and Bliss(1987) zero coupon bond data available from the Center for
Research in Securities Prices (CRSP). The data is monthly and spans from June 1952 through
June 2012 for a total of T = 721 observations with maturities of (1, 3, 12, 24, 36, 48, 60)
5For non-Gaussian ATSMs, Collin-Dufresne, Goldstein, and Jones(2008) proposed a novel approach toestimate non-Gaussian ATSMs based on observable factors that are implied by the theoretical propertiesof continuous-time models. They define the state variables in terms of the slope, curvature, and integratedcovariance of the instantaneous short rate. These quantities would be observable if a continuous recordof yield data were available. Finding empirical counterparts for these theoretical quantities is non-trivial,especially over long periods of time.
19
months. For three factor models, the yields measured without error Y(1)t include the (1,
12, 60) month maturities. In models with four factors, Y(1)t are the (1, 12, 24, 60) month
maturities.
4.2 Identification
Proposition 1 gives guidelines for identification of the model. For the pure Gaussian part,
a number of parameters enter the log-likelihood in the same way. This requires: (i) G
restrictions on µg and µQg to prevent shift; (ii) (H + 1)G(G − 1)/2 restrictions to identify
Σi,g from Σi,gΣ′i,g; (iii) G restrictions between Σi,g and δ1g to prevent scaling; (iv) G(G− 1)
restrictions between Φg,ΦQg and Σi,g to prevent rotation.6 For the pure non-Gaussian part,
this requires: (i) H restrictions imposed on µh to prevent shift; (ii) H restrictions on Σh and
δ1h to prevent scaling; (iii) H(H − 1) restrictions on Σh, Φh, and ΦQh to prevent rotation.
An additional GH restrictions are required on the matrices ΦQgh,Φgh, and Σgh to prevent
rotation between the factors.
An identification exercise similar to Hamilton and Wu(2012b) indicates that only δ0 and
the three eigenvalues in ΦQ are identified from the cross-section. This is because in a just-
identified model the univariate cross-sectional regression of Y(2)t on xt can only identify four
parameters that enter the bond loadings. In both Gaussian and non-Gaussian models, these
parameters are enough to determine the bond loadings B for specific rotations of the state
vector and when Jensen’s inequality terms are small. For Gaussian models, these parameters
also determine A and hence the bond pricing function. For non-Gaussian models, there
are more parameters in Q that determine A and these are identified from the time-series
component of the likelihood.
In our empirical work, we impose the following identifying restrictions. For the Gaussian
part, these are: (i) µQg = 0; (ii) ΦQg in ordered Jordan form;7 (iii) δ1g = ι is a column vector
6For special cases such as repeated eigenvalues in Jordan form, there are additional restrictions, whichwe discuss in Section 6.3.
7For the case where ΦQg has real distinct eigenvalues, it is a diagonal matrix with diagonal elements in
20
of ones; and (iv) Σi,g is lower triangular. For the non-Gaussian part with H = 1, (i) µh = 0;
(ii) δ1h = ±1. For the cross terms, ΦQgh = 0. Elements of the vector δ1h can take either sign,
which unlike Gaussian-only models will lead to inequivalent maxima as we explain in Section
5.2. It is also possible to estimate the parameters νQh , νh as they are identified regardless of
the rotation of the factors, based on Proposition 1.
5 A three factor model
In this section, we use our new method to estimate a three factor model with one volatility
factor, which has been the preferred model by many researchers.8 This is the A1(3) model
in the Dai and Singleton(2000) notation. For an A1(3) model, the concentrated likelihood
drops the number of parameters by one-third from 24 parameters to 16 parameters.
We focus on two aspects of estimation: (1) we compare the performance of our estimation
method with the conventional approach of directly maximizing the log-likelihood; (2) we
discuss why local maxima exist in models with both Gaussian and non-Gaussian factors.
5.1 Performance comparison
To illustrate the mileage we gain from using our method, we compare our approach to the
conventional method that does not concentrate out (µg,Φg,Φgh) or use analytical gradients.
We perform a Monte Carlo experiment where we estimate the A1(3) model on the CRSP
dataset 100 hundred times from 100 different starting values using both methods.9 We com-
pare our method and the direct approach along two dimensions: convergence and speed. To
measure the former, we use the likelihood ratio (two times the difference in log-likelihoods).
descending order.8This model has been widely considered as the benchmark non-Gaussian ATSM, see Dai and Single-
ton(2000), Cheridito, Filipovic, and Kimmel(2007), Collin-Dufresne, Goldstein, and Jones(2008), and Aıt-Sahalia and Kimmel(2010) for examples.
9To make the comparison as parallel as we can, we write the likelihood function the same way, imposethe same identifying restrictions, and use the same scaling and initial values for the parameters except thatthe conventional method has additional parameters entering the numerical optimizer.
21
The global solution found by our method has a log-likelihood of 36647.69 (estimates and
asymptotic robust standard errors can be found on the right hand side of Table 2). We
achieve an identical value for all 17 random starting values whenever the parameters were
initialized in this region or mode of the parameter space.10 Seventeen equals the number of
times (one-sixth) that it started in this region. Conversely, the conventional method does not
find this log-likelihood once nor does the method reproduce the same (incorrect) estimates
for each of these 17 starting values. The highest log-likelihood value found by the standard
approach is 36645.29, and it is only achieved for one starting value. The difference between
the two methods corresponds to a likelihood ratio of 4.8. The null hypothesis that the two
likelihood values are statistically the same will be rejected by a χ2 test, even if our method
has 1 more degree of freedom than the conventional method. In short, the conventional
method does not achieve the global solution. Second, across these 17 starting values, the
conventional method yields log-likelihood values ranging between 36645.29 to 36636.82, the
difference between these two numbers again are statistically significant. With our method
producing the same number repeatedly, we can conclude that it is a maximum. The fact
that a conventional approach does not repeatedly find the same value even when they are
initialized in the same region makes it extremely difficult to understand the behavior of the
log-likelihood surface and consequently the economic implications of the model.
An immediate benefit of the stable behavior of our method is that we are able to find
that the A1(3) model has 6 local modes with three well-behaved local maxima and three
regions of the log-likelihood that appear to be locally unidentified. The three well-behaved
local maxima are listed in Table 1 and we will discuss the properties of the model that create
these local modes in Section 5.2. Our method converges 17/100 times to Local 1, 14/100
times to Local 2, and 17/100 times to Local 3. Inspection of the starting values indicates that
if our procedure is started under the corresponding well-behaved local maxima, it converges
to the correct location. This is not true when the log-likelihood is maximized directly using
10We consider two log-likelihoods to be numerically identical if they agree up to 2 decimal points. Inpractice, the log likelihood values are identical up to 8 decimal points.
22
the un-concentrated log-likelihood and no analytical gradients. The median likelihood ratio
between our procedure and the un-concentrated log-likelihood with no analytical gradient
is 29.5 indicating a substantial difference between the two procedures. The conventional
method, even if it gets close to a local maximum, always stops before it fully converges. This
makes it difficult for researchers to differentiate between points that are near a well-behaved
local maximum that have the same economic meaning and locations corresponding to local
maxima that are economically different. The fact that our method always finds the local
maximum within the region helps us to uncover the different local maxima, and allows us to
study the economic implications of them.
Estimation time is another important dimension along which we compare our approach
to the conventional method. The median estimation time for our procedure to estimate
from a random starting value is less than 2 minutes, whereas the conventional approach of
directly maximizing the log-likelihood function takes more than 2 hours. To perform our
Monte Carlo study with 100 starting values, it takes our method about 4 hours, whereas it
takes roughly 9 days to complete the same exercise with the conventional method.
In summary, our method addresses all of the following problems with the conventional
method. The conventional method is painfully slow. It does not achieve the global maximum.
And, it is extremely hard to assess convergence behavior and the number of local maxima
because conventional approaches do not repeatedly find the same local maximum even when
started in that region of the parameter space.
5.2 Local maxima
Using our approach simplifies estimation and helps uncover some features of the log-likelihood
surface that may be obscured by directly maximizing the log-likelihood. In this section,
we discuss the characteristics of the model that create local maxima and their economic
consequences.
In Gaussian ATSMs, a change in the sign of δ1g rotates the factors from gt to −gt. The
23
Table 1: Local maxima in the A1(3) modelLocal 1 Local 2 Local 3
ht level slope curvature
ΦQh 0.9961 0.9512 0.5412ΦQg 0.9514 0.9992 0.9965
0.5358 0.5672 0.9507δ1,h 1 1 -1LLF 36647.69 36442.15 36477.72
Maximum likelihood estimates of ΦQh and ΦQg with corresponding log-likelihood.
rotation of δ1g is economically irrelevant because the estimated model switches between two
global maximums. As a result, researchers need to fix the sign of δ1g to achieve identification.
Unlike Gaussian models, fixing the sign of δ1h is not inconsequential. The state variable ht is
positive by definition. Changing the sign of δ1h does not rotate ht to −ht. Therefore, there
can exist inequivalent local maxima for each combination of different signs of δ1h. For each
of the local maxima, the estimated latent factor ht is different, which changes the conditional
variance of gt and consequently the log-likelihood.
Reordering the eigenvalues in ΦQ has completely different implications for non-Gaussian
models than for Gaussian models.11 If the eigenvalues are reordered in a multi-factor Gaus-
sian model, it implies equivalent global maxima with the same economic implication. How-
ever, with non-Gaussian factors, they can yield inequivalent local maxima. Here, we demon-
strate the intuition using the A1(3) model, although the basic idea holds for all non-Gaussian
models. The factors are labeled as level, slope and curvature, from most persistent to least
persistent. Reordering the eigenvalues across ΦQg and ΦQh does not generally change the
shape of the factors but it does change whether ht is the level, slope, or curvature. Any
change in ht from one type of factor (level) to another (slope/curvature) implies a different
conditional variance for gt making the likelihood no longer equivalent. More importantly,
11We collect the autoregressive parameters together in matrices as
Φ =
(Φh 0Φgh Φg
)ΦQ =
(ΦQh 0
ΦQgh ΦQg
)
24
the economic implications that can be drawn from the model such as evidence about the
expectations hypothesis, term premia, estimates of conditional volatilities, and forecasts will
change. Changing the order of the eigenvalues within ΦQg and/or ΦQh results in the factors
being reordered within each respective state vector. This results in an equivalent global
maximum. The intuition is the same as re-ordering of the factors gt within a Gaussian
ATSM.
In a non-Gaussian ATSM, it is not clear a priori which local maximum created by these
characteristics of the model will be the global maximum. To estimate a non-Gaussian model,
one must intentionally search each region that potentially has a local maximum and compare
their likelihood values. To illustrate this idea, we present different local maxima for theA1(3)
model corresponding to different signs of δ1,h and different orderings of the eigenvalues. We
report ΦQ, δ1,h, and log-likelihood values in Table 1. In the first column, ht is the level factor
and δ1,h is positive. This is the global maximum in this case. In our sample, volatility is high
during episodes where interest rates are high, so the level factor tends to explain the volatility
best and δ1,h is positive. The next two columns present what happens when ht is the slope or
curvature factor. Due to the nature of the data we are using, the likelihood function drops
significantly from the global maximum to these alternative local maxima. In theory, there
are six potentially different local maxima for each combination of eigenvalues and sign of
δ1,h but in practice there are only three well-behaved local maxima with the remaining local
maxima being locally unidentified. In summary, when estimating non-Gaussian models with
both Gaussian and non-Gaussian factors, we recommend trying to intentionally find each of
the local maxima and compare their log-likelihood values.
6 Model comparison
In this section, we use our methodology to estimate several more ATSMs. We restrict
attention to models with at least three factors. We impose the Feller condition νh,i > 1 and
25
Table 2: Maximum likelihood estimates for the A0(3) and A1(3) models.
G = 3, H = 0 LLF = 37080.94 G = 2, H = 1 LLF = 36647.69µg Σhνh µg νh
6.97e-05 -4.85e-05 -3.37e-04 2.97e-05 -1.32e-05 3.32e-05 1.934(6.43e-05) (4.65e-05) (6.48e-05) (1.10e-04) (1.53e-05) (0.124)
Φg Φh
1.007 0.048 0.067 0.994(0.011) (0.016) (0.040) (0.004)
Φgh Φg
-0.012 0.938 0.019 0.008 0.985 0.066(0.008) (0.016) (0.047) (0.039) (0.055) (0.143)-0.037 -0.059 0.631 -0.041 -0.073 0.643(0.010) (0.018) (0.051) (0.036) (0.036) (0.087)
µQg δ0 ΣhνQh µQg δ0 νQh
0 0 0 0.0083 4.09e-05 0 0 -0.0011 2.637(0.0005) (0.0004) (0.417)
ΦQg ΦQh ΦQg0.995 0.954 0.530 0.996 0.951 0.536
(0.0007) (0.003) (0.029) (0.0009) (0.003) (0.033)Σ0,g Σh
3.99e-04 0 0 1.55e-05(2.52e-05) (1.60e-06)
Σgh Σ0,g Σ1,g
-3.09e-04 5.09e-04 0 -0.893 8.20e-09 0 0.0063 0(3.83e-05) (3.71e-05) (0.104) (5.58e-08) (0.0005)-4.50e-06 -2.52e-04 3.78e-04 0.054 -1.90e-09 1.08e-08 -0.0035 0.0046(9.60e-06) (2.69e-05) (2.38e-05) (0.100) (1.23e-07) (7.33e-08) (0.0003) (0.0003)
δ1,g δ1,h δ1,g1 1 1 1 1 1
Maximum likelihood estimates with asymptotic robust standard errors. Left: Gaussian A0(3) model. Right:non-Gaussian A1(3) model. The restrictions µQg = 0, δ1,g = ι, and δ1,h = 1 are imposed during estimation.
νQh,i > 1 for i = 1, . . . , H and do not impose restrictions on the covariance matrix Ω.
6.1 Cross section
6.1.1 Three factor models
Besides the A1(3) model demonstrated in Section 5, we estimate the three factor Gaussian
model that is popular in the macro-finance literature. As in A1(3) models, the number
of parameters needed to be maximized numerically in the A0(3) model drops dramatically
when using the concentrated likelihood instead of the original likelihood. It drops by more
than half from 22 to 10.12
We report parameter estimates and asymptotic robust standard errors (see Hamilton(1994)
12The same improvement is also achieved in Joslin, Singleton, and Zhu(2011).
26
equation 5.8.7) for both models in Table 2. An interesting feature of the results is how the
estimated values of ΦQ are practically identical across both models and consequently so are
the bond loadings. This implies that the estimated latent factors (level, slope and curvature)
are also identical, see Figure 2 for the A0(3) and A1(3) models. The correlation between
each of the respective factors is 0.999. The only noticeable difference is the level of the level
factor, which is forced to be positive in the A1(3) model because it is the non-Gaussian state
variable.
As both the factors and bond loadings are identical, the cross-sectional component
p(Y
(2)t |xt; θ
)of the likelihood (9) is the same for the A0(3) and A1(3) models even with
Jensen’s inequality taken into account. This implies that when economic conclusions (e.g.
market prices of risk and term premia) differ across Gaussian and non-Gaussian three factor
models these differences are driven primarily by each model’s respective time series proper-
ties. In Section 6.2, we discuss the differences in term premia and conditional volatility in
more detail.
The fact that theA0(3) andA1(3) models fit the cross-section of yields equally well can be
explained by two things. First, the bond loading recursions for Gaussian and non-Gaussian
models in (6) and (7) are the same up to Jensen’s inequality. The Jensen’s inequality terms
are small empirically. Secondly, the magnitude of the cross-sectional measurement errors Ω is
much smaller than those of the dynamics Σi,gΣ′i,g. These matrices are key components of the
information matrix, which determines how much emphasis MLE gives to each component.
An efficient estimator, such as maximum likelihood, prioritizes the greater information (large
Ω−1) in the cross-section and chooses the parameters that enter the bond loadings B, i.e.
ΦQ, to match that feature of the data. As a result, the estimated values of ΦQ in both
models are practically identical, see Table 2. The estimates of ΦQ from Gaussian models
therefore provide excellent starting values when estimating non-Gaussian models. However,
as discussed in Section 5.2, the likelihood still has multiple modes depending on which factors
(level, slope, or curvature) are Gaussian and which factors are non-Gaussian. As the cross-
27
Figure 1: Estimated latent factors for the A0(3) and A1(3) models.
1960 1970 1980 1990 2000 2010
−10
−5
0
5
x 10−3 G = 3, H = 0: g
1t
1960 1970 1980 1990 2000 2010
−4
−2
0
2
4
6x 10
−3 G = 3, H = 0: g2t
1960 1970 1980 1990 2000 2010
−5
−4
−3
−2
−1
0
1
2
x 10−3 G = 3, H = 0: g
3t
1960 1970 1980 1990 2000 2010
0
5
10
15x 10
−3 G = 2, H = 1: ht
1960 1970 1980 1990 2000 2010
−4
−2
0
2
4
6
x 10−3 G = 2, H = 1: g
1t
1960 1970 1980 1990 2000 2010
−5
−4
−3
−2
−1
0
1
2
x 10−3 G = 2, H = 1: g
2t
Figure 2: Estimated latent factors for the A0(3) and A1(3) models. Top row: Gaussian A0(3) model with
from left to right the first g1t, second g2t and third g3t factors. Bottom row: non-Gaussian A1(3) model with
from left to right the first ht, second g1t, and third g2t factors.
sectional component p(Y
(2)t |xt; θ
)of the likelihood (9) is the same (this can be easily seen
from Table 1 across different local maxima), which factors act as volatility are pinned down
by the time series portion of the likelihood.
6.1.2 Four factor models
Next, we consider two four factor models: the Gaussian A0(4) model and the non-Gaussian
A1(4) model. There are a total of 35 parameters in the A0(4) model and only 20 of these
parameters enter the numerical optimizer, while there are 39 parameters in the A1(4) model
and 15 of these can be concentrated out. When estimating the A1(4) model, we found that
ΦQg had a pair of repeated eigenvalues, requiring the use of the Jordan decomposition for
28
Table 3: Maximum likelihood estimates for the A0(4) and A1(4) models.
G = 4, H = 0 LLF = 37195.84 G = 3, H = 1 LLF = 36729.22µg Σhνh µg νh
3.89e-04 -9.73e-04 1.28e-03 -8.19e-04 5.64e-05 3.07e-04 -3.76e-05 -3.10e-04 2.6778(1.68e-04) (2.20e-04) (2.66e-04) (3.64e-04) (3.13e-04) (3.39e-05) (3.15e-04) (2.2549)
Φg Φh
1.034 0.087 -0.014 0.091 0.990(0.038) (0.058) (0.016) (0.034) (0.008)
Φgh Φg
-0.078 0.834 0.204 -0.079 0.004 0.867 1.170 0.091(0.085) (0.111) (0.120) (0.083) (0.006) (0.041) (1.210) (0.224)0.0890 0.154 0.694 0.1926 0.002 0.015 0.816 0.0028(0.112) (0.177) (0.174) (0.148) (0.003) (0.003) (0.106) (0.017)-0.085 -0.141 -0.040 0.546 -0.035 -0.009 -0.124 0.651(0.073) (0.118) (0.093) (0.124) (0.011) (0.039) (1.050) (0.199)
µQg ΣhνQh µQg νQh
0 0 0 2.71e-05 0 0 0 1.284(0.338)
ΦQg ΦQh ΦQg0.992 0.960 0.876 0.696 0.995 0.912 – 0.702
(0.003) (0.013) (0.033) (0.043) (0.002) (0.015) – (0.054)Σ0,g Σh
6.94e-04 0 0 0 2.11e-05(2.29e-04) (6.28e-06)
Σgh Σ0,g
-1.47e-03 9.77e-04 0 0 0.925 8.37e-04 0 0(3.24e-04) (4.71e-04) (0.826) (5.71e-04)1.66e-03 -1.39e-03 8.74e-04 0 -0.217 -9.03e-05 6.85e-13 0
(2.24e-04) (4.43e-04) (3.23e-04) (0.057) (5.25e-05) (2.24e-05)-8.06e-04 5.65e-04 -7.18e-04 4.04e-04 -1.563 -7.96e-04 9.43e-11 4.04e-10(3.47e-04) (1.90e-04) (3.53e-04) (2.91e-05) (0.723) (5.51e-04) (9.31e-05) (3.77e-05)
Σ1,g
1.04e-02 0 0(3.42e-03)-6.75e-04 8.15e-04 0(2.97e-04) (1.01e-04)-9.01e-03 1.12e-03 4.68e-03(3.57e-03) (7.67e-04) (3.59e-04)
δ0 δ03.90e-03 -4.43e-04
(7.07e-04) (3.98e-04)δ1,g δ1,h δ1,g
1 1 1 1 1 1 1 1
Maximum likelihood estimates with asymptotic standard errors. Left: Gaussian A0(4) model. Right:non-Gaussian A1(4) model. The restrictions µQg = 0, δ1,g = ι, and δ1,h = 1 are imposed during estimation.
ΦQg .13 We will explain this in more detail in Section 6.3 below.
13With two repeated eigenvalues in the Gaussian factors, the Jordan decomposition of ΦQg in the A1(4)model becomes
ΦQg =
λ1 1 00 λ1 00 0 λ2
where λ1 and λ2 are the unique eigenvalues.
29
Parameter estimates and asymptotic robust standard errors for both models are in Table
3. The A1(4) model has three fewer parameters in the conditional mean and six more
parameters in the conditional variance. Output from these two models is comparable just as
it was for the two three factor models. The largest and smallest values in ΦQ are close across
both models. The two middle elements in the matrix ΦQ of the A1(4) model have been
imposed to be the same (repeated real eigenvalues). The estimated value of this parameter
is roughly the average of the two middle estimates in the matrix ΦQ of the A0(4) model.
The overall magnitude of the cross-sectional likelihood p(Y
(2)t |xt; θ
)is still basically the
same indicating that the primary differences across the models are found in the times series
component of the likelihood. When repeated real eigenvalues are imposed on the A0(4)
model, the estimated log-likelihood is 37194.35. The Gaussian and non-Gaussian models
then have equivalent factors.
6.1.3 Information in the cross section
The previous sections point to the fact that MLE prioritizes the cross section, and the
elements of ΦQ for both models are estimated with high precision. The primary source of
this precision are the high powered polynomial functions of ΦQ in the bond loadings B in
(6) and (7). The polynomial functions are sensitive to small changes in these parameters,
especially as ΦQ gets closer to one. This is why ΦQ is estimated precisely, especially for the
level factor. Given this argument, a natural question is whether it is necessary to have the
whole yield curve or just a handful of yields in order to get a precise description of the cross
section.
From the discussion on identification for Gaussian ATSMs in Hamilton and Wu(2012b),
we know that only four yields (one in the cross section Y(2)t ) are required to estimate Gaussian
models with three factors. A similar analysis of identification for AH(3) models indicates
that the same is true for any number of volatility factors. Any number of yields available
from the cross-section greater than one do not identify more Q parameters. Additional yields
30
only provide over-identifying information. Do they increase the precision of the parameter
estimates? If so, by how much?
We run an experiment where we estimate the model using all possible different combina-
tions of subsets of yields in Y(2)t . Our goal is to study the incremental information in these
yields and we use the size of the standard errors as our proxy. We use the A0(3) for demon-
stration. With only one yield included in Y(2)t , the standard errors for the largest eigenvalue
in ΦQg range from 0.00078 to around 0.0028, as opposed to 0.00071 when the model is esti-
mated with all the yields included. Even the largest standard error is still small compared
to the point estimates of 0.995 and it is an order of magnitude smaller than the uncertainty
of the P counterpart. The smallest standard error is obtained when Y(2)t includes only the
3 year yield and the largest is when it includes the 3-month yield. To pin down the cross
section with greater precision, it helps to spread out the yields between Y(1)t and Y
(2)t . Once
we include two yields in Y(2)t , the standard errors are smaller and exhibit less variability. The
smallest standard error for the largest eigenvalue is 0.00071 if we include both the 3 month
and 3 year yields, which is the same as in Table 2. The overall message is that a handful of
yields in the cross section contain a large amount of information because of the high power
polynomial. This contrasts with the popular view that the superior information comes from
the large number of cross-sectional yields at any time period.
6.2 Time series
The class of ATSMs define a set of non-nested models making direct comparison of the
models based on likelihood ratio statistics infeasible. However, by means of any information
criteria, Gaussian models are preferred. This is because Gaussian models such as the A0(3)
and A0(4) have higher log-likelihood values and fewer parameters than their respective A1(3)
and A1(4) counterparts.
In the remainder of this section, we focus on the two dimensions in which Gaussian and
non-Gaussian models differ in the time series dimension, namely their conditional means
31
and variances under P. Non-Gaussian ATSMs add parameters to the model to make the
conditional variance time-varying but impose restrictions on the conditional mean. The
restriction that non-Gaussian models impose comes from the fact that the conditional mean
of ht+1|It does not depend on gt by construction, i.e., the matrices Φ and ΦQ both contain
blocks of zeros. The economic implication of this restriction (using the terminology of three
factor models) is the level factor does not depend on the past values of slope and curvature
factors. This is apparently counterintuitive. If the slope is high in the last period, i.e., the
long rate is much higher than the short rate (more than explained by compensating for risk),
then it means the market expects the short rate will increase in the future. On average, the
next periods’ short rate or level will increase. A similar point has been made previously by
Duffee(2002).
Next, we compare the impact of the restrictions on the conditional mean using the term
premium, which is a popular and important measure for monetary policy. The term premium
measures the additional compensation a risk averse agent needs to hold the risky asset relative
to a risk-neutral agent. The term premium is defined as the difference between the model
implied yield ynt and the average of expected future short rates over the same period
rpnt ≡ ynt − ynt , ynt =1
nEt (rt + rt+1 + . . .+ rt+n−1)
see, e.g. Cochrane and Piazzesi(2008).14 Defining risk compensation in terms of the term
premium has the nice feature that it is invariant to the rotation of the state vector unlike the
market prices of risk. The restrictions on the conditional mean have a large impact on the
flexibility of the term premia for non-Gaussian models. Figure 3 plots the one year and five
year term premia for all four models. At short horizons, the difference in the term premia
between theA0(·) andA1(·) models is small but it grows larger as the horizon increases. Over
14The solution to this expectation is
ynt =1
n
(nδ0 + δ′1
[(n− 1)I + (n− 2)Φ + . . .+ Φ(n−2)
]µ+ δ′1
[I + Φ + Φ2 + . . .+ Φ(n−1)
]xt
).
32
Figure 3: One year and five year term premia
1960 1970 1980 1990 2000 20100
0.5
1
1.5
2
2.5
3
3.5
4
Term premia 1 yr
A
0(3)
A1(3)
1960 1970 1980 1990 2000 20100
0.5
1
1.5
2
2.5
3
3.5
4
Term premia 5 yr
A
0(3)
A1(3)
1960 1970 1980 1990 2000 20100
0.5
1
1.5
2
2.5
3
3.5
4
Term premia 1 yr
A
0(4)
A1(4)
1960 1970 1980 1990 2000 20100
0.5
1
1.5
2
2.5
3
3.5
4
Term premia 5 yr
A
0(4)
A1(4)
Estimated term premia at the one and five year horizons. Top left: 1 year term premia from the A0(3) and
A1(3) models. Top right: 5 year term premia from the A0(3) and A1(3) models. Bottom left: 1 year term
premia from the A0(4) and A1(4) models. Bottom right: 5 year term premia from the A0(4) and A1(4)
models.
longer horizons, the fact that the most persistent of the factors in the A0(·) model reverts to
its unconditional mean faster implies that there is more variation in term premium from one
period to the next. In particular, the A0(3) and A1(3) models have substantially different
implications for the 5 year term premium.
The estimated conditional volatilities of yields from the A0(3) and A1(3) models are in
Figure 4, while the estimated conditional volatilities of yields from the four factor models
have the same qualitative features and are not shown. To provide a point of comparison,
we also plot in these graphs estimates of the conditional volatilities from the multivari-
33
Figure 4: Estimated conditional volatility of yields in the A1(3) model.
1960 1970 1980 1990 2000 2010
0.5
1
1.5
2
x 10−3 Volatilities: 1 mth
A
0(3)
A1(3)
GAS model
1960 1970 1980 1990 2000 2010
2
4
6
8
10
12
14
16
x 10−4 Volatilities: 1 yr
A
0(3)
A1(3)
GAS model
1960 1970 1980 1990 2000 20100
0.005
0.01
0.015
Variance ht compared to the 1 mth and 5 yr yield
h
t
1 mth yield5 yr yield
1960 1970 1980 1990 2000 20101
2
3
4
5
6
7
8
9
x 10−4 Volatilities: 5 yr
A
0(3)
A1(3)
GAS model
Estimated conditional volatility of the yields measured without error Y(1)t for the A0(3) and A1(3) models
compared with the multivariate generalized autoregressive score model. Top left: volatility of the one month
yield. Top right: volatility of the one year yield. Bottom left: variance ht compared to the 1 month and 5
year yields. Bottom right: volatility of the five year yield.
ate generalized autoregressive score model of Creal, Koopman, and Lucas(2011) and Creal,
Koopman, and Lucas(2013).15 We find that A1(·) models are able to capture the broad trend
of volatilities, whereas A0(·) models have constant volatility by construction. However, the
estimated volatilities from A1(·) models are much less volatile than GAS volatility. A sim-
ilar observation for the A1(3) and A1(4) model was made by Collin-Dufresne, Goldstein,
and Jones(2009) using univariate GARCH models. The intuition is that in A1(·) models
15The generalized autoregressive score model with time-varying covariance matrix is similar to a multi-variate GARCH model. To make the volatilities of yields comparable across models, we use a VAR(1) for
the conditional mean of yields Y(1)t and allow the errors to have time-varying volatilities and correlations.
34
Table 4: Repeated EigenvaluesLocal 1 Local 2 Local 3 Repeated
ΦQh 0.9952 0.9952 0.9952 0.9952ΦQg 0.9130 0.9164 0.9126 0.9121
0.9112 0.9075 0.9115 –0.7021 0.7025 0.7021 0.7021
LLF 36729.22 36729.20 36729.22 36729.22
the non-Gaussian state variable ht serves a dual role: it is the level factor as well as the
volatility factor. The maximum likelihood estimator chooses the parameter vector θ to fit
the conditional mean first before fitting the conditional variance.
In the bottom left panel of Figure 4, we also compare the variance factor ht together
with the short rate (1 month yield) and the long rate (5 year). Contrary to the conventional
wisdom that the volatility of interest rates is driven by the short rate, we find that the
variance factor mimics the movement of the long term interest rate more closely with a
correlation between ht and the 5 year yield of 97.6%.
6.3 Repeated eigenvalues
For identification of the parameters of the model, a necessary condition is that B1 is invert-
ible. When ΦQ has repeated eigenvalues, the matrix B1 will be singular and one element
in ΦQ is unidentified. For a matrix with repeated eigenvalues, the Jordan decomposition
imposes the restrictions necessary to obtain identification.
An example of repeated real eigenvalues occurs when we estimate the A1(4) model on
the Fama-Bliss data set. When the A1(4) model is estimated without imposing repeated
eigenvalues, the model produces identical likelihoods for different sets of parameter vectors
θ. We report some of these optima in Table 4. Across the four local maxima, the values
of ΦQh , the last eigenvalue of ΦQg , and the log-likelihood function are almost identical. But
the first two eigenvalues of ΦQg vary across different optima. This empirical finding indicates
the existence of repeated eigenvalues, and not all the parameters are identified using the
35
diagonal form of ΦQ. The last column of Table 4 shows the results when we impose repeated
eigenvalues. The estimates of ΦQh , the last eigenvalue of ΦQg and the likelihood function
have the same values as before. However, the first two eigenvalues of ΦQg are identical by
definition and are equal to the average of the first two eigenvalues in those local maxima.
The log-likelihood value also does not change.
7 Conclusion
We generalize the class of discrete-time non-Gaussian ATSMs by allowing for any admissible
rotation of the non-Gaussian state variables and we provide a new approach to estimate
them. The new estimation approach leverages the fact that many of the parameters (i.e. P
parameters) can be concentrated out of the likelihood function. Our method improves the
estimation dramatically by reducing the number of parameters that need to be maximized
numerically. At the same time, the parameters that are concentrated out can cause numerical
problems due to the near unit-root nature of interest rates. We illustrate that our method
speeds up estimation by more than 60 times, and finds maxima consistently. We also explain
why there exist non-equivalent local maxima and their different economic implications. Using
our new method, we find that Gaussian and non-Gaussian models fit the cross-section of
yields equally well. Differences in the economic implications from these models comes from
their relative fit of the time series. Finally, we explain where the superior cross sectional
information comes from, and demonstrate that it is not necessarily due to observing a large
number of cross sectional yields.
Our method can be used for any rotation of the state variables implied by the identifying
restrictions. The methods can be applied to ATSMs that include observable macroeconomic
variables or hidden factors, and allows for restrictions on the key parameters of interest.
The fact that our method can be implemented successfully without relying on any specific
rotation is critical for non-Gaussian models. Unlike Gaussian models where all the factors
36
are symmetric, rotations of the factors in non-Gaussian models do not result in equivalent
local maxima.
37
References
Abramowitz, Milton, and Irene A. Stegun (1964) Handbook of Mathematical Functions
Dover Publications Inc, New York, NY.
Aıt-Sahalia, Yacine, and Robert L. Kimmel (2010) “Estimating affine multifactor term
structure models using closed-form likelihood expansions” Journal of Financial Eco-
nomics 98, 113–144.
Ang, Andrew, and Monika Piazzesi (2003) “A no-arbitrage vector autoregression of term
structure dynamics with macroeconomic and latent variables” Journal of Monetary
Economics 50, 745–787.
Bauer, Michael D. (2011) “Bayesian Estimation of Dynamic Term Structure Models
under Restrictions on Risk Pricing” Federal Reserve Bank of San Francisco Working
Paper 2011-03.
Bauer, Michael D., Glenn D. Rudebusch, and Jing Cynthia Wu (2012) “Correcting esti-
mation bias in dynamic term structure models.” Journal of Business and Economic
Statistics 30, 454–467.
Cheridito, Patrick, Damir Filipovic, and Robert L. Kimmel (2007) “Market price of
risk specifications for affine models: theory and evidence” Journal of Financial Eco-
nomics 84, 123–170.
Christensen, Jens H.E., Francis X. Diebold, and Glenn D. Rudebusch (2011) “The affine
arbitrage-free class of Nelson-Siegel term structure models.” Journal of Econometrics
164, 4–20.
Cochrane, John H., and Monika Piazzesi (2008) “Decomposing the yield curve” Unpub-
lished manuscript, Booth School of Business, University of Chicago.
Collin-Dufresne, Pierre, Robert S. Goldstein, and Charles Jones (2008) “Identification
of maximal affine term structure models.” The Journal of Finance 63, 743–795.
38
Collin-Dufresne, Pierre, Robert S. Goldstein, and Charles Jones (2009) “Can the volatil-
ity of interest rates be extracted from the cross section of bond yields? An investiga-
tion of unspanned stochastic volatility.” Journal of Financial Economics 94, 47–66.
Cox, John C., Jonathan E. Ingersoll, and Stephen A. Ross (1985) “A theory of the term
structure of interest rates” Econometrica 53, 385–407.
Creal, Drew D., Siem Jan Koopman, and Andre Lucas (2011) “A dynamic multivariate
heavy-tailed model for time-varying volatilities and correlations.” Journal of Busi-
ness and Economic Statistics 29, 552–563.
Creal, Drew D., Siem Jan Koopman, and Andre Lucas (2013) “Generalized autoregres-
sive score models with applications” Journal of Applied Econometrics 28, 777–795.
Dai, Qiang, and Kenneth J. Singleton (2000) “Specification analysis of affine term struc-
ture models.” The Journal of Finance 55, 1943–1978.
de Jong, Piet (1991) “The diffuse Kalman filter” The Annals of Statistics 19, 1073–83.
Diebold, Francis X., and Glenn D. Rudebusch (2013) Yield Curve Modeling and Fore-
casting. Princeton University Press, Princeton, NJ.
Duffee, Gregory R. (2002) “Term premia and interest rate forecasts in affine models”
The Journal of Finance 57, 405–443.
Duffee, Gregory R. (2011) “Information in (and not in) the term structure” The Review
of Financial Studies 24, 2895–2934.
Duffee, Gregory R. (2012) “Bond pricing and the macroeconomy” Working paper, Johns
Hopkins University.
Duffie, Darrell, and Rui Kan (1996) “A yield factor model of interest rates” Mathemat-
ical Finance 6, 379–406.
Durbin, James, and Siem Jan Koopman (2012) Time Series Analysis by State Space
Methods Oxford University Press, Oxford, UK 2 edition.
39
Fama, Eugene F., and Robert R. Bliss (1987) “The information in long maturity forward
rates” American Economic Review 77, 680–692.
Gourieroux, Christian, and Joann Jasiak (2006) “Autoregressive gamma processes.”
Journal of Forecasting 25, 129–152.
Gourieroux, Christian, Joann Jasiak, and Razvan Sufana (2009) “The Wishart autore-
gressive process of multivariate stochastic volatility.” Journal of Econometrics 150,
167–181.
Gourieroux, Christian, and Alain Monfort (1995) Statistics and Econometric Models
volume 1 Cambrige University Press, Cambridge, UK.
Gurkaynak, Refet S., and Jonathan H. Wright (2012) “Macroeconomics and the term
structure” Journal of Economic Literature 50, 331–367.
Hamilton, James D (1994) Time Series Analysis Princeton University Press, Princeton,
NJ.
Hamilton, James D., and Jing Cynthia Wu (2012a) “The effectiveness of alternative
monetary policy tools in a zero lower bound environment” Journal of Money, Credit,
and Banking 44 (s1), 3–46.
Hamilton, James D., and Jing Cynthia Wu (2012b) “Identification and estimation of
Gaussian affine term structure models.” Journal of Econometrics 168, 315–331.
Hamilton, James D., and Jing Cynthia Wu (2012c) “Testable implications of affine term
structure models.” Journal of Econometrics forthcoming.
Joslin, Scott, Kenneth J. Singleton, and Haoxiang Zhu (2011) “A new perspective on
Gaussian affine term structure models” The Review of Financial Studies 27, 926–
970.
Kim, Don H., and Athanasios Orphanides (2005) “Term structure estimation with sur-
vey data on interest rate forecasts.” Federal Reserve Board, Finance and Economics
40
Discussion Series 2005-48.
Kim, Don H., and Jonathan H. Wright (2005) “An arbitrage-free three-factor term
structure model and the recent behavior of long-term yields and distant-horizon
forward rates.” Federal Reserve Board, Finance and Economics Discussion Series
2005-33.
Le, Anh, Kenneth J. Singleton, and Qiang Dai (2010) “Discrete-time affine term struc-
ture models with generalized market prices of risk.” The Review of Financial Studies
23, 2184–2227.
Litterman, Robert, and Jose Scheinkman (1991) “Common factors affecting bond re-
turns” The Journal of Fixed Income 1, 54–61.
Piazzesi, Monika (2010) “Affine term structure models” in Handbook of Financial Econo-
metrics, edited by Y. Ait-Sahalia and L. P. Hansen Elsevier, New York pages 691–
766.
Wright, Jonathan H. (2011) “Term premia and inflation uncertainty: empirical evidence
from an international panel dataset” American Economic Review 101(4), 1514–1534.
41
Appendix A Distributions
We start by defining several of the distributions found in the paper, which are useful for implementing the
procedures in practice. The notation for these distributions is local to the appendix.
Appendix A.1 Gamma and multivariate gamma distributions
A univariate gamma r.v. wt+1 ∼ Gamma (νh, κ) has p.d.f p (wt+1|νh, κ) = 1Γ(νh)w
νh−1t+1 κ−νh exp
(−wt+1
κ
)and
Laplace transform E [exp (uwt+1)] =(
11−κu
)νh, which exists only if κu < 1. The mean and variance are
E (wt+1) = νhκ and V (wt+1) = νhκ2.
A multivariate gamma random vector ht+1 ∼ Mult. Gamma (νh,Σh, µh) can be obtained by shifting and
rotating a vector of uncorrelated gamma r.v.’s. It can be written as ht+1 = µh + Σhwt+1 where wt+1 is an
H × 1 vector with elements wi,t+1 ∼ Gamma (νh,i, 1) for i = 1, . . . ,H. The H × 1 vector of (non-negative)
location parameters is µh, Σh is a full rank H ×H matrix of (non-negative) scale parameters, and νh > 0 is
a H × 1 vector of shape parameters. The p.d.f of ht+1 can be determined by a standard change-of-variables
p (ht+1|νh,Σh, µh) = |Σ−1h |
H∏i=1
1
Γ (νh,i)
(e′iΣ−1h [ht+1 − µh]
)νh,i−1exp
(−e′iΣ
−1h [ht+1 − µh]
)where ei is an H × 1 unit vector that selects out the i-th element of a vector. The mean and variance are
E [ht+1] = µh + Σhνh and V [ht+1] = Σhdiag (νh) Σ′h. The Laplace transform is
E [exp (u′ht+1)] =
∫ ∞0
exp (u′ht+1) p (ht+1|νh,Σh, µh) dht+1
= exp (u′µh)
∫ ∞0
exp (u′Σhwt+1)
H∏i=1
1
Γ (νh,i)wνh,i−1i,t+1 exp (−wt+1) dwt+1
= exp (u′µh)
H∏i=1
(1
1− e′iΣ′hu
)νh,i= exp
(u′µh −
H∑i=1
νh,i log [1− e′iΣ′hu]
)
The Laplace transform exists only if e′iΣ′hu < 1 for i = 1, . . . ,H.
42
Appendix A.2 Multivariate non-central gamma distributions
A H × 1 non-central gamma (NCG) random vector ht+1 ∼ Mult.-N.C.G. (νh,Φhht,Σh, µh) is a Poisson
mixture of multivariate gamma r.v.’s
ht+1 = µh + Σhwt+1
wi,t+1 ∼ Gamma (νh,i + zi,t+1, 1) i = 1, . . . ,H
zi,t+1 ∼ Poisson(e′iΣ−1h ΦhΣhwt
)i = 1, . . . ,H.
The process ht remains positive and well-defined as long as µh ≥ 0, Σ−1h ΦhΣh ≥ 0, and elements of Σh
cannot be negative. The conditional mean and variance are
E (ht+1|ht) = (IH − Φh)µh + Σhνh + Φhht
V (ht+1|ht) = Σhdiag(νh − 2Σ−1h Φhµh)Σ′h + Σhdiag(2Σ−1
h Φhht)Σ′h
A standard multivariate NCG random variable (i.e. the discrete-time CIR process) is obtained by setting
µh = 0 and letting Σh be a diagonal matrix. Further properties of the univariate NCG process are described
in Gourieroux and Jasiak(2006).
As long as Σh has full rank, the p.d.f. can be found by integrating out the Poisson r.v.’s
p (ht+1|νh,Φhht,Σh, µh) = |Σ−1h | exp
(−
H∑i=1
e′iΣ−1h [ht+1 − µh] + e′iΣ
−1h Φh [ht − µh]
)H∏i=1
(e′iΣ−1h [ht+1 − µh]
)νh,i−1
∞∑zi,t=0
1
Γ (νh,i + zi,t)
1
zi,t!
[(e′iΣ−1h [ht+1 − µh]
) (e′iΣ−1h Φh [ht − µh]
)]zi,tUsing the definition of the modified Bessel function of the first kind16, the p.d.f. can be expressed as
p (ht+1|νh,Φhht,Σh, µh) = |Σ−1h | exp
(−
H∑i=1
e′iΣ−1h [ht+1 − µh] + e′iΣ
−1h Φh [ht − µh]
)H∏i=1
(e′iΣ−1h [ht+1 − µh]
) νh,i−1
2(e′iΣ−1h Φh [ht − µh]
)− νh,i−1
2
Iνh,i−1
(2√(
e′iΣ−1h [ht+1 − µh]
) (e′iΣ−1h Φh [ht − µh]
)).
16This is defined as Iλ(x) =(x2
)λ∑∞z=0
1Γ(λ+z+1)z!
(x2
4
)z, see Abramowitz and Stegun(1964).
43
The Laplace transform can be derived from the law of iterated expectations
E [exp (u′ht+1)] = Ez(Eh|z [exp (u′ht+1)]
)= Ez
(exp (u′µh)
H∏i=1
(1
1− e′iΣ′hu
)νh,i+zi)
= exp (u′µh)
H∏i=1
(1
1− e′iΣ′hu
)νh,iEz
(H∏i=1
(1
1− e′iΣ′hu
)zi)
= exp (u′µh)
H∏i=1
(1
1− e′iΣ′hu
)νh,i H∏i=1
exp
((e′iΣ−1h Φh [ht − µh]
)e′iΣ′hu
1− e′iΣ′hu
)
= exp
(u′µh +
H∑i=1
e′iΣ′hu
1− e′iΣ′hu
e′iΣ−1h Φh (ht − µh)−
H∑i=1
νh,i log (1− e′iΣ′hu)
)
where e′iΣ′hu denotes the i-th element of the H×1 vector Σ′hu. The Laplace transform exists only if e′iΣ
′hu < 1
for i = 1, . . . ,H.
Appendix A.3 Mixture of Gaussian and mult. NCG distributions
From standard results in statistics, the multivariate (G × 1) Gaussian r.v. gt+1 ∼ N(gt+1|µg,ΣgΣ′g
)has
Laplace transform E [exp (u′gt+1)] = exp(µ′gu+ 1
2u′ΣgΣ
′gu)
for any real (G × 1) vector u. Consider
a (G + H) × 1 vector xt+1 = (h′t+1, g′t+1)′ where ht+1 is an H × 1 vector having a multivariate NCG
distribution p (ht+1|νh,Φhht,Σh, µh) and gt+1 is a G × 1 vector of conditionally Gaussian r.v. gt+1 ∼
N(µg + Σghht+1,ΣgΣ
′g
). Let u = (u′h, u
′g)′ where uh and ug are H×1 and G×1 vectors, respectively. Using
the law of iterated expectations, the Laplace transform is
E [exp (u′xt+1)] = E[exp
(u′ggt+1
)exp (u′hht+1)
]= Eh
[Eg|h
[exp
(u′ggt+1
)]exp (u′hht+1)
]= Eh
[exp
((µg + Σghht+1)′ug +
1
2u′gΣgΣ
′gug
)exp (u′hht+1)
]= exp
(u′gµg +
1
2u′gΣgΣ
′gug
)Eh[exp
([u′gΣgh + u′h
]ht+1
)]= exp
(u′gµg +
1
2u′gΣgΣ
′gug + u′ghµh −
H∑i=1
νh,i log (1− e′iΣ′hugh)
+
H∑i=1
e′iΣ′hugh
1− e′iΣ′hugh
e′iΣ−1h Φh (ht − µh)
)
where ugh = Σ′ghug+uh is an H×1 vector. The Laplace transform exists only if e′iΣ′hugh < 1 for i = 1, . . . ,H.
This is the key expression for solving for closed-form zero-coupon bond prices.
44
Appendix B Stochastic discount factor
We define the stochastic discount factor as
Mt+1 =exp (−rt) p (gt+1|It, ht+1, zt+1; θ,Q) p (ht+1|It, zt+1; θ,Q) p (zt+1|It; θ,Q)
p (gt+1|It, ht+1, zt+1; θ,P) p (ht+1|It, zt+1; θ,P) p (zt+1|It; θ,P)
where the distributions are conditionally Gaussian, conditionally gamma, and Poisson. This is the exact
(non-linear) SDF with no approximations, which we use during estimation. For intuition, consider breaking
the log-stochastic discount factor mt+1 into three terms; one for each of the shocks that the economic agent
faces
mt+1 = −rt +mg,t+1 +mh,t+1 +mz,t+1
where mi,t+1 is the compensation for risk i. Let λg = µg −µQg , λh = νh− νQh , Λg = Φg −ΦQg , Λh = Φh−ΦQh ,
and Λgh = Φgh − ΦQgh.
Appendix B.1 Gaussian risks
Starting with the Gaussian portion, we find
mg,t+1 = −1
2λ′gtλgt − λ′gtεg,t+1
where εg,t+1 = Σ−1g,tεg,t+1 is a standard, zero mean Gaussian shock. The price of Gaussian risk is
λgt = Σ−1g,t (λg + Λggt + Λghht)− Σgh [Σhλh + Λh (ht − µh)]
This is a clear generalization of the expression for Gaussian ATSMs. The key differences is a time-varying
quantity of risk Σg,t.
45
Appendix B.2 Gamma risks
Recall from the definition of the non-Gaussian process that wt+1 = Σ−1h (ht+1 − µh). We will write risk
compensation in terms of wt+1.
mh,t+1 =
H∑i=1
− log Γ(νQh,i + zi,t+1
)+(νQh,i + zi,t+1 − 1
)log(e′iΣ−1h (ht+1 − µh)
)− e′iΣ
−1h (ht+1 − µh)
H∑i=1
log Γ (νh,i + zi,t+1)− (νh,i + zi,t+1 − 1) log(e′iΣ−1h (ht+1 − µh)
)+ e′iΣ
−1h (ht+1 − µh)
=
H∑i=1
log
Γ (νh,i + zi,t+1)
Γ(νQh,i + zi,t+1
)− (νh,i − νQh,i) log
(e′iΣ−1h (ht+1 − µh)
)
≈H∑i=1
log(
[νh,i + zi,t+1]νh,i−νQh,i
)− λh,i log (wi,t+1)
=
H∑i=1
−λh,i [log (wi,t+1)− log (νh,i + zi,t+1)]
=
H∑i=1
−λh,i[log
(1 +
wi,t+1 − νh,i − zi,t+1
νh,i + zi,t+1
)]
≈H∑i=1
− λh,i√νh,i + zi,t+1
wi,t+1 − νh,i − zi,t+1√νh,i + zi,t+1
This implies that the compensation for gamma risks is approximately17
mh,t+1 ≈ −λ′wtεw,t+1
where εw,t+1,i =wi,t+1−νh,i−zi,t+1√
νi+zi,t+1is a gamma r.v. standardized to have mean zero and variance one. The
market price of risk is λwt,i =λh,i√
νh,i+zi,t+1. We note that V (wit|zt) = νh,i + zi,t.
17Our derivation uses the approximation that Γ(a+x)Γ(b+x) ∝ x
a−b (1 +O(
1x
))for large x. We also use the fact
that log(1 + x) = x for small x.
46
Appendix B.3 Poisson risks
Consider the non-Gaussian part due to the Poisson distribution
mz,t+1 =
H∑i=1
zi,t+1 log(
e′iΣ−1h ΦQh (ht − µh)
)− log (zi,t+1!)− e′iΣ
−1h ΦQh (ht − µh)
−zi,t+1 log(e′iΣ−1h Φh (ht − µh)
)+ log (zi,t+1!) + e′iΣ
−1h Φh (ht − µh)
=
H∑i=1
zi,t+1 log(
e′iΣ−1h ΦQh (ht − µh)
)− zi,t+1 log
(e′iΣ−1h Φh (ht − µh)
)+ e′iΣ
−1h
(Φh − ΦQh
)(ht − µh)
=
H∑i=1
zi,t+1 log
(1−
e′iΣ−1h ΛhΣhwt
e′iΣ−1h ΦhΣhwt
)+ e′iΣ
−1h ΛhΣhwt
≈H∑i=1
−zi,t+1e′iΣ−1h ΛhΣhwt
e′iΣ−1h ΦhΣhwt
+ e′iΣ−1h ΛhΣhwt
=
H∑i=1
−e′iΣ−1h ΛhΣhwt
(zi,t+1 − e′iΣ
−1h ΦhΣhwt
e′iΣ−1h ΦhΣhwt
)
=
H∑i=1
−e′iΣ−1h ΛhΣhwt√
e′iΣ−1h ΦhΣhwt
εz,t+1,i
The log stochastic discount factor is
mz,t+1 ≈ −λ′ztεz,t+1
where εz,t+1 =zi,t+1−e′iΣ
−1h ΦhΣhwt√
e′iΣ−1h ΦhΣhwt
is Poisson r.v. standardized to have mean 0 and variance 1 and λzt,i =
e′iΣ−1h ΛhΣhwt√
e′iΣ−1h ΦhΣhwt
.
Appendix C Bond pricing
Appendix C.1 Bond pricing recursions
Bond prices can be solved by induction. Guess that bond prices are Pnt = exp(an + b′n,hht + b′n,ggt
)for
some coefficients an, bn,h, and bn,g. At maturity n = 1 when the payoff is P 0t+1 = 1, we find
P 1t = EQt
[exp (−rt)P 0
t+1
]= exp
(−δ0 − δ′1,hht − δ′1,ggt
)
47
such that a1 = −δ0, b1,g = −δ1,g and b1,h = −δ1,h. Next, consider an n-period bond whose price in the next
period is Pn−1t+1 . We find
Pnt = EQt[exp (−rt)Pn−1
t+1
]= EQt
[exp
(−δ0 − δ′1,hht − δ′1,ggt
)exp
(an−1 + b′n−1,hht+1 + b′n−1,ggt+1
)]= exp
(−δ0 − δ′1,hht − δ′1,ggt + an−1
)EQt[exp
(b′n−1,hht+1 + b′n−1,ggt+1
)]where the expectation is taken with respect to the distribution of the random vector xt+1 =
(h′t+1, g
′t+1
)′under Q such that
ht+1Q∼ Mult-NCG
(νQh,i,Φ
Qhht,Σh, µh
)gt+1
Q∼ N(µQg + ΦQg gt + ΦQghht + Σgh
[ht+1 −
((IH − ΦQh )µh + Σhν
Qh + ΦQhht
)],Σg,tΣ
′g,t
)
This expectation has the same form as the Laplace transform provided in Appendix A. Using ei to denote
a H × 1 unit vector, we find
Pnt = exp
(−δ0 − δ′1,ggt − δ′1,hht + an−1 +
1
2b′n−1,gΣg,tΣ
′g,tbn−1,g
+[µQg + ΦQg gt + ΦQghht − Σgh
((IH − ΦQh )µh + Σhν
Qh + ΦQhht
)]′bn−1,g
+[Σ′ghbn−1,g + bn−1,h
]′µh +
H∑i=1
e′iΣ′hbn−1,gh
1− e′iΣ′hbn−1,gh
e′iΣ−1h Φh [ht − µh]−
H∑i=1
νQh,i log(1− e′iΣ
′hbn−1,gh
))
= exp
(−δ0 + an−1 + µQ′g bn−1,g +
[Σ′ghbn−1,g + bn−1,h
]′µh −
((IH − ΦQh )µh + Σhν
Qh
)′Σ′ghbn−1,g
+1
2b′n−1,gΣg,tΣ
′g,tbn−1,g −
H∑i=1
νQh,i log(1− e′iΣ
′hbn−1,gh
)−
H∑i=1
e′iΣ′hbn−1,gh
1− e′iΣ′hbn−1,gh
e′iΣ−1h Φhµh
+[b′n−1,gΦ
Qg − δ′1,g
]gt
+
H∑i=1
e′iΣ′hbn−1,gh
1− e′iΣ′hbn−1,gh
e′iΣ−1h Φhht + b′n−1,g
(ΦQgh − ΣghΦQh
)ht − δ′1,hht
)= exp
(−δ0 + an−1 + µQ′g bn−1,g + µ′h
[bn−1,h + ΦQ′h Σ′ghbn−1,g
]− νQ′h Σ′hΣ′ghbn−1,g
+1
2b′n−1,gΣ0,gΣ
′0,g bn−1,g −
H∑i=1
νQh,i log(1− e′iΣ
′hbn−1,gh
)−
H∑i=1
e′iΣ′hbn−1,gh
1− e′iΣ′hbn−1,gh
e′iΣ−1h Φhµh
+[b′n−1,gΦ
Qg − δ′1,g
]gt
+
[H∑i=1
e′iΣ′hbn−1,gh
1− e′iΣ′hbn−1,gh
e′iΣ−1h Φh + b′n−1,g
(ΦQgh − ΣghΦQh
)− δ′1,h
+1
2
(IH ⊗ bn−1,g
)′ΣgΣ
′g
(ιH ⊗ bn−1,g
)]ht
)
48
where ΣgΣ′g is a GH × GH matrix with diagonal elements Σi,gΣ
′i,g for i = 1, . . . ,H. The expression
bn−1,gh = Σ′ghbn−1,g + bn−1,h is an H × 1 vector. The Laplace transform exists only if e′iΣ′hbn−1,gh < 1 for
i = 1, . . . ,H.
Appendix D Log-likelihood function
The log-likelihood for the general affine model is given by
`(θ) = CONST− (T − 1) log |det (B1)| − T − 1
2log |Ω| − 1
2
T∑t=2
tr(Ω−1ηtη
′t
)− 1
2
T∑t=2
log∣∣Σg,t−1Σ′g,t−1
∣∣−1
2
T∑t=2
tr((
Σg,t−1Σ′g,t−1
)−1εgtε
′gt
)−(T − 1) log|Σh| −
T∑t=2
H∑i=1
e′iΣ−1h (ht − µh)−
T∑t=2
H∑i=1
e′iΣ−1h Φh (ht−1 − µh)
+
T∑t=2
H∑i=1
(νh,i − 1)
2log(e′iΣ−1h [ht − µh]
)−
T∑t=2
H∑i=1
(νh,i − 1)
2log(e′iΣ−1h Φh [ht−1 − µh]
)+
T∑t=2
H∑i=1
log Iνh,i−1
(2√(
e′iΣ−1h [ht − µh]
) (e′iΣ−1h Φh [ht−1 − µh]
))
where Iλ (x) is the modified Bessel function of the first kind, see Abramowitz and Stegun(1964). We use ei
to denote the H × 1 unit vector.
Appendix E Proof of Propositions
Appendix E.1 Proof of Proposition 1
The necessary admissibility restrictions to keep the non-Gaussian factors positive are
1. Chg = 0;
2. Chh is restricted such that all elements ChhΣh are non-negative;
3. ch is restricted such that all elements in ch + Chhµh are non-negative;
4. Chh and Cgg are full rank.
For some values of θ, these restrictions may allow ch and Chh to be negative.
49
Under these restrictions, the process xt+1 =(h′t+1, g
′t+1
)′is a member of the same family of distributions
as xt+1 =(h′t+1, g
′t+1
)′only under a new parameters θ. The proof of this proposition is immediate by
comparing the Laplace transform of xt+1 to the Laplace transform of xt+1. The mapping between the new
parameters θ and the original parameters θ is given by
µh = ch + Chhµh
Φh = ChhΦhC−1hh
Σh = ChhΣh
µg = cg + Cggµg − CggΦgC−1gg cg + Cgh ([IH − Φh]µh + Σhνh)
−(CghΦh − CggΦgC−1
gg Cgh + CggΦgh)C−1hh ch
Φg = CggΦgC−1gg
Φgh =(CghΦh − CggΦgC−1
gg Cgh + CggΦgh)C−1hh
Σgh = (Cgh + CggΣgh)C−1hh
Σ0,gΣ′0,g = CggΣ0,gΣ
′0,gC
′gg −
H∑i=1
CggΣi,gΣ′i,gC
′gge′iC−1hh ch
Σi,gΣ′i,g =
H∑j=1
CggΣj,gΣ′j,gC
′gge′jC−1hh ei
Appendix E.2 Proof of Proposition 2
The proof that maximizing a concentrated log-likelihood will result in maximization of the original log-
likelihood can be found in Property 7.4 of Gourieroux and Monfort(1995). To prove Proposition 2, we
only need to show that the first-order conditions for θc = (µg,Φg,Φgh,Ω) can be solved analytically as a
function of θm using GLS. Let β′ = (µg,Φg,Φgh) and gt = gt − Σgh [ht − ((I − Φh)µh + Σhνh + Φhht−1)],
and Gt−1 = (ι, gt−1, ht−1). We use Dk to denote the duplication matrix, and DLk to denote the duplication
matrix for lower triangular (i.e. not necessarily symmetric) matrices.
The derivatives for the parameters in θc are
∂`
∂vec (β)′ = vec
(T∑t=2
Gt−1g′t
(Σg,t−1Σ′g,t−1
)−1
)− vec
(T∑t=2
Gt−1G′t−1β
(Σg,t−1Σ′g,t−1
)−1
)∂`
∂vech (Ω)′ = −T − 1
2vec(Ω−1
)′DG +1
2
T∑t=2
vec(Ω−1ηtη
′tΩ−1)′DG
These two first order conditions are the same as generalized least squares and can be solved to obtain
50
θc(θm) =(µg(θm), Φg(θm), Φgh(θm), Ω(θm)
), which concludes the proof of Proposition 2.
Appendix F Analytical Derivatives
In this appendix, we provide the analytical derivatives.
Appendix F.1 Preliminary lemma
The following lemma shows that the gradients of the log-likelihood and concentrated log-likelihood are
related. This enables us to derive the gradients for the concentrated log likelihood based on the original
log-likelihood, which makes the derivation easier.
Lemma 1 The derivative of the concentrated log-likelihood function `c (θm) ≡ `(θc (θm) , θm
)with respect
to θm can be computed as the partial derivative of the log-likelihood function `(θc, θm
)with respect to θm,
where θc (θm) = arg maxθc
` (θc, θm)
d`c (θm)
dθ′m=∂`(θc, θm
)∂θ′m
Proof:
d`c (θm)
dθ′m≡
d`(θc (θm) , θm
)dθ′m
=∂`(θc, θm
)∂θ′m
+∂`(θc, θm
)∂θ′c
dθc (θm)
dθ′m=
∂`(θc, θm
)∂θ′m
where∂`(θc,θm)
∂θ′c= 0 by the definition of θc.
Appendix F.2 Proof of Proposition 3
Using Lemma 1, we find
d`(θc (θm) , θm, A (θm) , B (θm)
)dθ′m
=d`(θc, θm, A (θm) , B (θm)
)dθ′m
.
51
Applying the chain rule,
d`(θc, θm, A (θm) , B (θm)
)dθ′m
=∂`(θc, θm, A,B
)∂θ′m
+∂`(θc, θm, A,B
)∂A′
∂A (θm)
∂θ′m
+∂`(θc, θm, A,B
)∂vec (B′)
′∂vec
(B (θm)
′)∂θ′m
.
which concludes the proof.
Appendix F.3 Gradients
Given Proposition 3, we can now provide the analytical gradient. Note that θc = (µg, Φg, Φgh, Ω) are
optimized by θc = argmaxθc
` (θc, θm) detailed in Proposition 2. For convenience, let ht = Σ−1h (ht − µh) and
ˆht−1 = Σ−1
h Φh (ht−1 − µh), and hit = 2√(
e′iΣ−1h [ht − µh]
) (e′iΣ−1h Φh [ht−1 − µh]
).
∂`
∂νh,i= −
T∑t=2
ε′gt(Σg,t−1Σ′g,t−1
)−1ΣghΣhei +
1
2
T∑t=2
log
(e′iht
e′iˆht−1
)
+
T∑t=2
1
Iνh,i−1
(hit
) ∂Iνh,i−1
(hit
)∂νh,i
∂`
∂vec (Φh)′ = −
T∑t=2
vec(
Σ′gh(Σg,t−1Σ′g,t−1
)−1εgth
′t−1
)′−
T∑t=2
H∑i=1
vec((ht−1 − µh) e′iΣ
−1h
)′ − T∑t=2
H∑i=1
(νh,i − 1)
2e′iˆht−1
vec((ht−1 − µh) e′iΣ
−1h
)′+
T∑t=2
H∑i=1
(νh,i − 1)
hit+
Iνh,i
(hit
)Iνh,i−1
(hit
) 2e′iht
hitvec((ht−1 − µh) e′iΣ
−1h
)′
where we have used the fact that ∂Iλ(x)∂x = λ
xIλ(x) + Iλ+1(x), see, Abramowitz and Stegun(1964). The
derivative ∂Iλ(x)∂λ is a complicated expression that is easier to compute numerically. The derivatives for the
parameters that only enter the bond loadings are calculated in two steps via the chain rule. First, we take
derivatives of ` w.r.t. the loadings A1, A2, B1 and B2. Then, we take derivatives of the bond loadings with
52
respect to the model’s parameters inside the bond loadings.
∂`
∂δ0=
∂`
∂A′∂A
∂δ0∂`
∂δ′1,g=
∂`
∂A′∂A
∂δ′1,g+
∂`
∂vec (B′)′∂vec (B′)
∂δ′1,g
∂`
∂δ′1,h=
∂`
∂A′∂A
∂δ′1,h+
∂`
∂vec (B′)′∂vec (B′)
∂δ′1,h∂`
∂µQ′g=
∂`
∂A′∂A
∂µQ′g
∂`
∂vec(
ΦQg)′ =
∂`
∂A′∂A
∂vec(
ΦQg)′ +
∂`
∂vec (B′)′∂vec (B′)
∂vec(
ΦQg)′
∂`
∂vec(
ΦQgh
)′ =∂`
∂A′∂A
∂vec(
ΦQgh
)′ +∂`
∂vec (B′)′∂vec (B′)
∂vec(
ΦQgh
)′∂`
∂νQ′h=
∂`
∂A′∂A
∂νQ′h∂`
∂vec(
ΦQh
)′ =∂`
∂A′∂A
∂vec(
ΦQh
)′ +∂`
∂vec (B′)′∂vec (B′)
∂vec(
ΦQh
)′The derivatives of the remaining parameters that enter both the loadings and the P dynamics are
d`
dvec (Σh)′ =
∂`
∂vec (Σh)′ +
∂`
∂A′∂A
∂vec (Σh)′ +
∂`
∂vec (B′)′∂vec (B′)
∂vec (Σh)′
d`
dvec (Σgh)′ =
∂`
∂vec (Σgh)′ +
∂`
∂A′∂A
∂vec (Σgh)′ +
∂`
∂vec (B′)′∂vec (B′)
∂vec (Σgh)′
d`
dvech (Σ0,g)′ =
∂`
∂vech (Σ0,g)′ +
∂`
∂A′∂A
∂vech (Σ0,g)′
d`
dvech (Σi,g)′ =
∂`
∂vech (Σi,g)′ +
∂`
∂A′∂A
∂vech (Σi,g)′ +
d`
∂vec (B′)′∂vec (B′)
∂vech (Σi,g)′
d`
dµ′h=
∂`
∂µ′h+
∂`
∂A′∂A
∂µ′h
53
We need the following derivatives
∂`
∂vec (Σh)′ = −
T∑t=2
vec(
Σ′gh(Σg,t−1Σ′g,t−1
)−1εgtν
′h
)′− (T − 1)vec
(Σ−1h
)′+
T∑t=2
H∑i=1
vec(hte′iΣ−1h
)′+
T∑t=2
H∑i=1
vec(
ˆht−1e′iΣ
−1h
)′+
T∑t=2
H∑i=1
(νh,i − 1)
2e′ihtvec(hte′iΣ−1h
)′−
T∑t=2
H∑i=1
(νh,i − 1)
2e′iˆht−1
vec(
ˆht−1e′iΣ
−1h
)′
−T∑t=2
H∑i=1
(νh,i − 1)
hit+
Iνh,i
(hit
)Iνh,i−1
(hit
) 2e′iht
hitvec(
ˆht−1e′iΣ
−1h
)′
−T∑t=2
H∑i=1
(νh,i − 1)
hit+
Iνh,i
(hit
)Iνh,i−1
(hit
) 2e′i
ˆht−1
hitvec(hte′iΣ−1h
)′∂`
∂vec (Σgh)′ =
T∑t=2
vec(εhtε
′gt
(Σg,t−1Σ′g,t−1
)−1)′
∂`
∂vech (Σ0,g)′ =
T∑t=2
vec([(
Σg,t−1Σ′g,t−1
)−1εgtε
′gt − IG
] (Σg,t−1Σ′g,t−1
)−1Σ0,g
)′DLG
∂`
∂vech (Σi,g)′ =
T∑t=2
vec([(
Σg,t−1Σ′g,t−1
)−1εgtε
′gt − IG
] (Σg,t−1Σ′g,t−1
)−1Σi,ghi,t−1
)′DLG
∂`
∂µ′h= −
T∑t=2
ε′gt(Σg,t−1Σ′g,t−1
)−1Σgh (IH − Φh)
+
T∑t=2
H∑i=1
e′iΣ−1h +
T∑t=2
H∑i=1
e′iΣ−1h Φh
−T∑t=2
H∑i=1
(νh,i − 1)
2e′ihte′iΣ−1h +
T∑t=2
H∑i=1
(νh,i − 1)
2e′iˆht−1
e′iΣ−1h Φh
−T∑t=2
H∑i=1
(νh,i − 1)
hit+
Iνh,i
(hit
)Iνh,i−1
(hit
) 2
hit
[e′ihte
′iΣ−1h Φh + e′i
ˆht−1e′iΣ
−1h
]
54
∂`
∂A′1=
T∑t=2
[−η′tΩ−1B2B
−11 + ε′gt
(Σg,t−1Σ′g,t−1
)−1[(IG − Φg)Sg − (Φgh + Σgh − ΣghΦh)Sh]B−1
1
+1
2vec((
Σg,t−1Σ′g,t−1
)−1[IG − εgtε′gt
(Σg,t−1Σ′g,t−1
)−1])′ H∑
i=1
vec(Σi,gΣ
′i,g
)ShiB
−11
+
T∑t=2
H∑i=1
e′iΣ−1h ShB
−11 −
T∑t=2
H∑i=1
(νh,i − 1)
2e′ihte′iΣ−1h ShB
−11
+
T∑t=2
H∑i=1
e′iΣ−1h ΦhShB
−11 +
T∑t=2
H∑i=1
(νh,i − 1)
2e′iˆht−1
e′iΣ−1h ΦhShB
−11
−T∑t=2
H∑i=1
(νh,i − 1)
hit+
Iνh,i
(hit
)Iνh,i−1
(hit
) 2e′i
ˆht−1
hite′iΣ−1h ShB
−11
−T∑t=2
H∑i=1
(νh,i − 1)
hit+
Iνh,i
(hit
)Iνh,i−1
(hit
) 2e′iht
hite′iΣ−1h ΦhShB
−11
∂`
∂A′2=
T∑t=2
η′tΩ−1
∂`
∂vec (B′1)′ = −(T − 1)vec
(B−1
1
)′+
T∑t=2
[−vec
(xtη′tΩ−1B2B
−11
)′+vec
(xtε′gt
(Σg,t−1Σ′g,t−1
)−1(Sg − ΣghSh)B−1
1
)′−vec
(xt−1ε
′gt
(Σg,t−1Σ′g,t−1
)−1(ΦgSg + [Φgh − ΣghΦh]Sh)B−1
1
)′+
1
2vec((
Σg,t−1Σ′g,t−1
)−1[IG − εgtε′gt
(Σg,t−1Σ′g,t−1
)−1])′ H∑
i=1
vec(Σi,gΣ
′i,g
) [ShiB
−11 ⊗ x′t−1
]+
T∑t=2
H∑i=1
vec(xte′iΣ−1h ShB
−11
)′ − T∑t=2
H∑i=1
(νh,i − 1)
2e′ihtvec(xte′iΣ−1h ShB
−11
)′+
T∑t=2
H∑i=1
vec(xt−1e′iΣ
−1h ΦhShB
−11
)′+
T∑t=2
H∑i=1
(νh,i − 1)
2e′iˆht−1
vec(xt−1e′iΣ
−1h ΦhShB
−11
)′−
T∑t=2
H∑i=1
(νh,i − 1)
hit+
Iνh,i
(hit
)Iνh,i−1
(hit
) 2e′i
ˆht−1
hitvec(xte′iΣ−1h ShB
−11
)′
−T∑t=2
H∑i=1
(νh,i − 1)
hit+
Iνh,i
(hit
)Iνh,i−1
(hit
) 2e′iht
hitvec(xt−1e′iΣ
−1h ΦhShB
−11
)′∂`
∂vec (B′2)′ =
T∑t=2
vec(xtη′tΩ−1)′
55
The derivatives of the bond loadings A and B with respect to each of the parameters can be computed
recursively as a function of maturity along with the loadings an, bn,g and bn,h. The derivatives of the Gaussian
loadings Bg and the non-Gaussian loadings Bh will have separate recursions. We use bn,g,ψ to denote the
derivatives of the Gaussian loadings bn,g at maturity n with respect to a parameter ψ. All recursions for the
derivatives are written assuming that the ψ is a full vector/matrix of parameters with no restrictions. In
practice, if the matrix has fewer parameters than entries, then the user will have to multiply the respective
recursion by a selection matrix. Let dn−1 = diag(ιH − Σ′hbn−1,gh
)be a diagonal H × H matrix. Let
c′n−1 =(νQ′d−1
n−1 − µQ′h ΦQ′h Σ−1′
h d−2n−1
)Σ′h be an 1 ×H vector. The recursions for the derivatives of A as a
function of maturity are
a′n,µQg1×G
= a′n−1,µQg
+ b′n−1,g
a′n,µh1×H
= a′n−1,µh+ b′n−1,h + b′n−1,gΣghΦQh − b
′n−1,ghΣhd
−1n−1Σ−1
h ΦQh
a′n,νQh
1×H
= a′n−1,νQh
− log(ι′H − b′n−1,ghΣh
)− b′n−1,gΣghΣh
a′n,Σ0,g
1×G(G+1)/2
= a′n−1,Σ0,g+ vec
(bn−1,g b
′n−1,gΣ0,g
)′DLGa′n,δ1h1×H
= a′n−1,δ1h+ µ′hbn−1,h,δ1h + c′n−1bn−1,h,δ1h
a′n,ΦQgh
1×GH
= a′n−1,ΦQgh
+ µ′hbn−1,h,ΦQgh+ c′n−1bn−1,h,ΦQgh
a′n,Σi,g1×G(G+1)/2
= a′n−1,Σi,g + µ′hbn−1,h,Σi,g + c′n−1bn−1,h,Σi,g i = 1, . . . ,H
a′n,ΦQh
1×H2
= a′n−1,ΦQh
+ µ′hbn−1,h,ΦQh+ c′n−1bn−1,h,ΦQh
+vec((
Σ′ghbn−1,g − Σ−1′h d−1
n−1Σ′hbn−1,gh
)µ′h)′
a′n,Σgh1×GH
= a′n−1,Σgh+ µ′hbn−1,h,Σgh + c′n−1bn−1,h,Σgh + vec
[bn−1,g
(µ′hΦQ′h − ν
Q′h Σ′h + c′n−1
)]′a′n,δ1g1×G
= a′n−1,δ1g + µ′hbn−1,h,δ1g + c′n−1bn−1,gh,δ1g
+(µQ′g + µ′hΦQ′h Σ′gh − ν
Q′h Σ′hΣ′gh
)bn−1,g,δ1g + b′n−1,gΣ0,gΣ
′0,g bn−1,g,δ1g
a′n,ΦQg
1×G2
= a′n−1,ΦQg
+ µ′hbn−1,h,ΦQg+ c′n−1dbn−1,gh,ΦQg
+(µQ′g + µ′hΦQ′h Σ′gh − ν
Q′h Σ′hΣ′gh
)bn−1,g,ΦQg
+ b′n−1,gΣ0,gΣ′0,g bn−1,g,ΦQg
a′n,Σh1×H(H+1)/2
= a′n−1,Σh+ µ′hbn−1,h,Σh + c′n−1bn−1,h,Σh + vec
(Σ−1′h d−1
n−1Σ′hbn−1,ghµ′hΦQ′h Σ−1′
h
)′−vec
(Σ′ghbn−1,gν
′h
)′+ vec
(bn−1,ghc
′n−1Σ−1′
h
)′The derivative of A with respect to δ0 is ιN and the initial conditions are b1,g,δ1g = −IG, b1,h,δ1h = −IH with
56
all other initial conditions starting at zero.
bn,g,δ1gG×G
= ΦQ′g bn−1,g,δ1g − IG
bn,g,ΦQgG×G2
= ΦQ′g bn−1,g,ΦQg+(IG ⊗ b′n−1,g
)bn,h,δ1hH×H
= ΦQ′h Σ−1′h d−2
n−1Σ′hbn−1,h,δ1h − IH
bn,h,ΦQghH×GH
= ΦQ′h Σ−1′h d−2
n−1Σ′hbn−1,h,ΦQgh+ IH ⊗ b′n−1,g
bn,h,Σi,gH×G(G+1)/2
= ΦQ′h Σ−1′h d−2
n−1Σ′hbn−1,h,Σgi + eivec(bn−1,g b
′n−1,gΣi,g
)′DLG i = 1, . . . ,H
bn,h,ΦQhH×H2
= ΦQ′h Σ−1′h d−2
n−1Σ′hbn−1,h,ΦQh− IH ⊗ b′n−1,gΣgh + IH ⊗ b′n−1,ghΣhd
−1n−1Σ−1
h
bn,h,ΣghH×GH
= ΦQ′h Σ−1′h d−2
n−1Σ′hbn−1,h,Σgh − ΦQ′h Σ−1′h
(IH − d−2
n−1
)Σ′h ⊗ b′n−1,g
bn,h,δ1gH×G
= ΦQ′h Σ−1′h d−2
n−1Σ′hbn−1,gh,δ1g +(
ΦQgh − ΣghΦQh
)′bn−1,g,δ1g +
(IH ⊗ b′n−1,g
)ΣgΣ
′g
(ιH ⊗ bn−1,g,δ1g
)bn,h,ΦQgH×G2
= ΦQ′h Σ−1′h d−2
n−1Σ′hbn−1,gh,ΦQg+(
ΦQgh − ΣghΦQh
)′bn−1,g,ΦQg
+(IH ⊗ b′n−1,g
)ΣgΣ
′g
(ιH ⊗ bn−1,g,ΦQg
)bn,h,Σh
H×H(H+1)/2
= ΦQ′h Σ−1′h d−2
n−1Σ′hbn−1,h,Σh −(
ΦQ′h Σ−1′h ⊗ b′n−1,ghΣhd
−1n−1Σ−1
h
)+(
ΦQ′h Σ−1′h d−2
n−1 ⊗ b′n−1,gh
)
where we also need to account for the derivatives of bn,gh = Σ′ghbn,g + bn,h as
bn−1,gh,δ1gH×G
= Σ′ghbn−1,g,δ1g + bn−1,h,δ1g
bn−1,gh,ΦQgH×G2
= Σ′ghbn−1,g,ΦQg+ bn−1,h,ΦQg
Notice that many of the derivatives of the loadings are zero for all maturities. These include bn,g,µhG×H
, bn,h,µQgH×G
, bn,g,δ1hG×H
,
bn,g,ΦQhG×H2
, bn,g,ΣhG×H2
, bn,g,Σ0,g
G×G(G+1)/2
, bn,h,Σ0,g
H×G(G+1)/2
, bn,g,ΦQghG×GH
, bn,g,Σi,gH×G(G+1)/2
, bn,g,ΣghG×GH
, bn,g,νQhG×H
, bn,h,νQhH×H
.
57
Appendix G Model appendix
Appendix G.1 General affine model
The general affine model can be summarized as
ht+1|xt ∼ Mult. N.C.-Gamma (νh,Φhht,Σh, µh) ,
gt+1 = µg + Φggt + Φghht + Σghεh,t+1 + εg,t+1, εg,t+1 ∼ N(0,Σg,tΣ
′g,t
),
ht+1|xt ∼ Mult. N.C.-Gamma(νQh ,Φ
Qhht,Σh, µh
),
gt+1 = µQg + ΦQg gt + ΦQghht + ΣghεQh,t+1 + εQg,t+1, εQg,t+1 ∼ N
(0,Σg,tΣ
′g,t
),
rt = δ0 + δ′1hht + δ′1ggt,
where
Σg,tΣ′g,t = Σ0,gΣ
′0,g +
H∑i=1
Σi,gΣ′i,ghit,
εh,t+1 = ht+1 − E (ht+1|xt) ,
εQh,t+1 = ht+1 − EQ (ht+1|xt) .
Appendix G.2 Model with observable macroeconomic factors
Let Y(m)t denote the M × 1 vector of observable macroeconomic variables. For simplicity, we will assume
that the macroeconomic variables are Gaussian. The state vector gt =(g
(y)t , g
(m)t
)includes both latent yield
factors g(y)t and macroeconomic factors g
(m)t . The model can be summarized as
ht+1|xt ∼ Mult. N.C.-Gamma (νh,Φhht,Σh, µh) ,
gt+1 = µg + Φggt + Φghht + Σghεh,t+1 + εg,t+1, εg,t+1 ∼ N(0,Σg,tΣ
′g,t
),
ht+1|xt ∼ Mult. N.C.-Gamma(νQh ,Φ
Qhht,Σh, µh
),
gt+1 = µQg + ΦQg gt + ΦQghht + ΣghεQh,t+1 + εQg,t+1, εQg,t+1 ∼ N
(0,Σg,tΣ
′g,t
),
rt = δ0 + δ′1hht + δ′1ggt
58
where
Σg,tΣ′g,t = Σ0,gΣ
′0,g +
H∑i=1
Σi,gΣ′i,ghit
εh,t+1 = ht+1 − E (ht+1|xt) ,
εQh,t+1 = ht+1 − EQ (ht+1|xt) .
The measurement equation relating the factors to the observed yields and macroeconomic variables is
Y(1)t
Y(m)t
=
A1
0
+
B1h B1g(y) B1g(m)
0M×H 0M×G IM
ht
g(y)t
g(m)t
The bond loadings for the macroeconomic factors B1g(m) can be calculated jointly with the bond loadings
for the latent Gaussian factors.
Appendix G.3 Gaussian model
The Gaussian model can be summarized as
gt+1 = µg + Φggt + εg,t+1, εg,t+1 ∼ N(0,Σ0,gΣ
′0,g
),
gt+1 = µQg + ΦQg gt + εQg,t+1, εQg,t+1 ∼ N(0,Σ0,gΣ
′0,g
),
rt = δ0 + δ′1ggt.
where the bond loadings recursions (5) and (7) still apply but with all non-Gaussian parameters set to zero.
Appendix G.4 “Hidden” factors
The general affine model with hidden Gaussian factors has three sets of state variables x = (h′t, g′1t, g
′2t)′,
where the dimensions of the Gaussian state variables are G1 × 1 and G2 × 1, respectively. The model is
parameterized intentionally such that only the state variables h′t and g′1t are priced by the cross-section. We
also define the model such that the non-Gaussian factors only impact the variances of the shock to g1t. The
59
dynamics under the P measure are
ht+1|xt ∼ Mult. N.C.-Gamma (νh,Φhht,Σh, µh) ,
g1,t+1
g2,t+1
=
µg,1
µg,2
+
Φgh,1 Φg,11 Φg,12
Φgh,2 Φg,21 Φg,22
ht
g1t
g2t
+
Σgh,1
Σgh,2
εh,t+1 +
εg,1,t+1
εg,2,t+1
,
εg,1,t+1 ∼ N(0,Σg,tΣ
′g,t
),
εg,2,t+1 ∼ N(
0,Σεg2 Σ′εg2
),
εh,t+1 = ht+1 − E (ht+1|xt) ,
Σg,tΣ′g,t = Σ0,gΣ
′0,g +
H∑i=1
Σi,gΣ′i,ghit,
and
rt = δ0 + δ′1hht + δ′1g1g1t + δ′1g2g2t,
The dynamics under the Q measure are the same with the following restrictions: δ1g2 = 0 and ΦQg,12 =
0. Under these restrictions, the bond loadings on the Gaussian state vector g2t will always be zero by
construction for all maturities. The bond loadings recursions (5), (6), and (7) remain the same as before.
Appendix H Relationship to continuous-time
Appendix H.1 Mapping between NCG and CIR
To see the intuition behind why the standard NCG process (µh = 0,diagonal Σh) converges to the Cox,
Ingersoll, and Ross(1985) process in the continuous-time limit, the autoregressive gamma process can be
written as a non-Gaussian autoregression with conditional heteroskedasticity
wi,t+1 = νh,iΣh,ii + Φ′h,iwt +√νh,iΣ2
h,ii + 2Σh,iiΦ′h,iwtεh,i,t+1
where the shock εh,i,t+1 = (wi,t+1 − Et[wi,t+1])/√Vt[wi,t+1] is a non-central Gamma random variable that
has been standardized to have mean zero and variance one. The shock εh,i,t+1 is a Poisson sum of gamma
random variables. Due to the infinite divisibility of the gamma distribution and the central limit theorem,
the standardized variable εh,i,t+1 will tend to a Gaussian random variable as the time between observations
60
τ → 0.
The continuous time representation is
dwit = (αi − κ′iwt)dt+ σi√witdWit, (H.1)
with the following mapping:
Φh = IH − κτ νh,i =2αiσ2i
Σh,ii =σ2i τ
2
The continuous-time process implies first and second conditional moments
E [dwit|wt] = (αi − κ′iwt)dt =⇒ E [wi,t+τ |wt] = αiτ + (ι′H − κ′iτ)wt
V [dwit|wt] = σ2iwitdt =⇒ V [wt+τ |wt] = σ2
iwitτ
while the discrete-time process implies
E [wi,t+τ |wt] = νh,iΣh,ii + Φ′h,iwt = αiτ + (ι′H − κ′iτ)wt
V [wi,t+τ |wt] = νh,iΣ2h,ii + 2Σh,iΦ
′h,iwt =
αiσ2i τ
2
2+ σ2
i τ(ι′H − κiτ)wt
When τ → 0, the discrete-time conditional moments approach their continuous-time counterparts.
Appendix H.2 Bond recursions in continuous time
Assume the factor dynamics under the Q measure are
dxt = (α− κxt)dt+ ΣtdWt,
where the covariance matrix is time varying, and depends only on ht:
ΣtΣ′t = Σ0Σ′0 +
H∑i=1
ΣiΣ′ihit
61
We impose admissibility restrictions to guarantee positivity of the non-Gaussian factor: α ≥ 0, κ1,hg = 0.
The continuous time version of bond recursions in equations (5) - (7) are as follows
˙a = α′b+1
2b′Σ0Σ′0b− δ0 (H.2)
˙bg = −κ′gg bg − δ1,g (H.3)
˙bh = −κ′hhbh − κ′ghbg +1
2
(IH ⊗ b′
)ΣΣ′
(ιH ⊗ b
)− δ1,h (H.4)
where ΣΣ′ is a block diagonal matrix with diagonal elements ΣiΣ′i for i = 1, . . . ,H.
62