Machine learning Methods in asset pricingpluto.huji.ac.il/~davramov/Machine_learning.pdf · Machine...

Machine learning Methods in asset pricing

Doron Avramov, IDC

and

Lior Metzker, Hebrew U

Updated, December 5, 20201

Machine learning Methods in asset pricing Machine learning prescribes a vast collection of high-dimensional models that

attempt to predict future quantities of interest while imposing regularization.

The sections below describe the architecture of various machine learning routines and their implementations in empirical asset pricing.

In finance, some routines draw on beta-pricing representations in asset pricing, some on pricing kernel formulations, some are reduced forms.

We start, for perspective, with ordinary least squares (OLS).

OLS is the best linear unbiased estimator (BLUE) of the regression coefficients.

“Best" = the lowest variance estimator among all other unbiased linear estimators.

Regression errors do not have to be normal, nor do they have to be independently and identically distributed.

But errors have to be zero mean, serially uncorrelated and non correlated with the predictors, as well as homoscedastic with finite variance.

In the presence of heteroskedasticity or autocorrelation, OLS is no longer BLUE.

2

Shortcomings of OLS

With heteroskedasticity, we can still use OLS estimators by finding

heteroskedasticity-robust estimators of the variance, or we can devise an efficient

estimator by re-weighting the data appropriately to incorporate heteroskedasticity.

Similarly, with autocorrelation, we can find an autocorrelation-robust estimator of

the variance. Alternatively, we can devise an efficient estimator by re-weighting the

data appropriately to account for autocorrelation.

Requiring the estimator to be linear is binding since non-linear estimators do exist.

This is where routines like non parametric Lasso and neural networks (NN) come to

play.

Likewise, the unbiasedness requirement is crucial since biased estimators do exist.

This is where shrinkage methods come to play: the variance of the OLS estimator

can be too high as OLS coefficients are unregulated.

If judged by Mean Squared Error (MSE), alternative biased estimators could be

more attractive if they produce substantially smaller variance.

3

Shortcomings of OLS

MSE of an OLS estimate is computed as follows.

Let 𝛽 denote the true regression coefficients, and let መ𝛽 = (𝑋′𝑋)−1𝑋′𝑌, where 𝑋 is a 𝑇 ×𝑀matrix of explanatory variables (first column is ones) and 𝑌 is a 𝑇 × 1 vector of the

dependent variable.

Then, the mean squared error of the OLS estimate is given by

MSE መ𝛽 = 𝐸 መ𝛽 − 𝛽′ መ𝛽 − 𝛽

= 𝐸 tr መ𝛽 − 𝛽′ መ𝛽 − 𝛽

= 𝐸 tr መ𝛽 − 𝛽 መ𝛽 − 𝛽′

= tr 𝐸 መ𝛽 − 𝛽 መ𝛽 − 𝛽′

= 𝜎2tr[ 𝑋′𝑋 −1].

When predictors are highly correlated, the expression tr 𝑋′𝑋 −1 quickly explodes.

4

Shortcomings of OLS

In the presence of many predictors, OLS delivers nonzero estimates for all coefficients – thus it is

difficult to implement variable selection when the true data generating process is sparse.

Moreover, the OLS solution is not unique if the design of X is not full rank.

Moreover, the OLS does not account for potential non linearities and interactions between predictors.

In sum, OLS is restrictive, often provide poor predictions, may be subject to over-fitting, does not

penalize for model complexity, and could be difficult to interpret.

Bayesian perspective: one can introduce informed priors on regression coefficients to shrink slopes

towards zero (or towards values implied by economic theory).

Classical (non-Bayesian) perspective: shrinkage methods penalize complexity and impose

regularization.

Non-linearities and interactions between predictors can also be accounted for.

Such objectives are accomplished through an assortment of machine learning methods.

5

Stock return Predictability: Economic restrictions on OLS

Prior to formulating ML methods, I describe two ways to possibly improve OLS estimates.

Source: Gu, Kelly, and Xiu (2019).

Base case: the pooled OLS estimator corresponds to a panel (balanced WLOG) regression of future returns

on firm attributes, where T and N represent the time-series and cross-section dimensions.

The objective is formulated as

ℒ 𝜃 =1

𝑁𝑇

𝑖=1

𝑁

𝑡=1

𝑇

𝑟𝑖,𝑡+1 − 𝑓 𝑥𝑖,𝑡; 𝜃2

where 𝑟𝑖,𝑡+1 is stock return at time t+1 per firm i, f= 𝑥𝑖,𝑡′ 𝜃 is the corresponding predicted return, 𝑥𝑖,𝑡 is

a set of firm characteristics, and 𝜃 stands for model parameters.

Predictive performance could be improved using optimization that value weights stocks based on market size, volatility, credit quality, etc.:

ℒ 𝜃 =1

𝑁𝑇

𝑖=1

𝑁

𝑡=1

𝑇

𝑤𝑖,𝑡 𝑟𝑖,𝑡+1 − 𝑓 𝑥𝑖,𝑡; 𝜃2

6

Stock return Predictability: Economic restrictions on OLS

An alternative optimization takes account of the heavy tail displayed by stocks and the potential harmul effects of outliers.

Then the objective is formulated such that squared (absolute) loss is applied to small (large) errors:

ℒ 𝜃 =1

𝑁𝑇

𝑖=1

𝑁

𝑡=1

𝑇

𝐻 𝑟𝑖,𝑡+1 − 𝑓 𝑥𝑖,𝑡; 𝜃 , 𝜉

where 𝜉 is a tuning hyper-parameter and

𝐻 𝑦, 𝜉 =𝑦2, if 𝑦 ≤ 𝜉

2𝜉 𝑦 − 𝜉2, if 𝑦 > 𝜉

The hyper-parameter 𝜉 is determined by the model performance in a validation sample.

Later, the selection of hyperparameters is described in detail.

7

Ridge Regression There are various shrinkage methods.

We start with Ridge.

Hoerl and Kennard (1970a, 1970b) introduce the Ridge regression

min (𝑌 − 𝑋𝛽)′(𝑌 − 𝑋𝛽) s. t. σ𝑗=1𝑀 𝛽𝑗

2 ≤ 𝑐

The minimization can be rewritten as

ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆(𝛽′𝛽)

We get መ𝛽ridge = (𝑋′𝑋 + 𝜆𝐼𝑀)

−1𝑋′𝑌

where 𝐼𝑀 is the identity matrix of order 𝑀.

Notice that including 𝜆 makes the problem non-singular even when 𝑋′𝑋 is non-invertible.

𝜆 is an hyper-parameter that controls for the amount of regularization.

8

Ridge Regression As 𝜆 → 0, the OLS estimator obtains.

As 𝜆 → ∞, we have መ𝛽ridge = 0, or intercept-only model.

Ridge regressions do not have a sparse representation (LASSO does have), so using model selection criteria to pick 𝜆 is infeasible.

Instead, validation methods should be employed.

The notion of validation is to split the sample into three pieces: training, validation, and test.

The training sample considers various values for 𝜆 each of which delivers a prediction. The validation sample is choosing 𝜆 that provides the smallest forecast errors. Hence, both training and validation samples are used to pick 𝜆. Then the experiment is assessed through out-of-sample predictions given the choice of 𝜆.

As shown below, from a Bayesian perspective, the parameter 𝜆 denotes the prior precision of beliefs that regression slope coefficients are all equal to zero.

Classical perspective: the ridge estimator is essentially biased:

𝐸 መ𝛽ridge = [𝐼𝑀 + 𝜆 𝑋′𝑋 −1]𝛽 ≠ 𝛽

Bayesian perspective: no bias. That is, 𝐸 መ𝛽ridge is simply the posterior mean of 𝛽.9

Interpretations of the Ridge RegressionInterpretation #1: Data Augmentation

The ridge-minimization problem can be formulated as

𝑡=1

𝑇

(𝑦𝑡 − 𝑥𝑡′𝛽)2 +

𝑗=1

𝑀

0 − 𝜆𝛽𝑗2

Thus, the ridge-estimator is the usual OLS estimator where the data is transformed such that

𝑋𝜆 =𝑋

𝜆𝐼𝑀, 𝑌𝜆 =

𝑌0𝑀

where 0M is an M-vector of zeros.

Then, it follows that መ𝛽ridge = 𝑋𝜆

′𝑋𝜆−1𝑋𝜆

′𝑌𝜆

= (𝑋′𝑋 + 𝜆𝐼𝑀)−1𝑋′𝑌

10

Interpretations of the Ridge RegressionInterpretation #2: Informative Bayes Prior

Suppose the prior on 𝛽 is of the form:

𝛽~𝑁 0,1

𝜆𝐼𝑀

Then, the posterior mean of 𝛽 is:

(𝑋′𝑋 + 𝜆𝐼𝑀)−1𝑋′𝑌

In the context of asset pricing, Bayesian methods are quite useful.

For instance, financial economists have considered the family of priors for asset mispricing

α~𝑁 0,𝜎2

𝑠2𝑉𝜂

where 𝑠2 = 𝑡𝑟𝑎𝑐𝑒 𝑉 = σ𝑗=1𝑁 𝜆𝑗 and the 𝜆𝑗 − 𝑠 are eigenvalues of 𝑉.

To prove the second equation above, notice that 𝑡𝑟𝑎𝑐𝑒 𝑉 = 𝑡𝑟𝑎𝑐𝑒 𝑄𝛬𝑄′ = 𝑡𝑟𝑎𝑐𝑒 𝛬𝑄′𝑄 = 𝑡𝑟𝑎𝑐𝑒 𝛬 = σ𝑗=1𝑁 𝜆𝑗

Q is a matrix of ordered eigen vectors and 𝛬 is a diagonal matrix with the corresponding ordered eigen values.

𝜎2 controls for the degree of confidence in the prior.

Limit cases: zero 𝜎2 means dogmatic beliefs while infinitely large 𝜎2 amounts to non-informative priors.

The case 𝜂 = 1 gives the asset pricing prior of Pastor (2000) and Pastor and Stambaugh (2000).

The Pastor-Stambaugh prior seems flexible since factors are pre-specified and are not ordered per their importance, yet there are also merits for the 𝜂 = 2 case, which applies to factors that are principal components, as discussed below.

11

Interpretations of the Ridge Regression

To continue the Bayesian interpretation, let us consider the Hansen-Jagannathan representation of the pricing kernel

𝑀𝑡 = 1 − 𝑏′ 𝑟𝑡 − 𝜇

= 1 − 𝜇′𝑉−1 𝑟𝑡 − 𝜇= 1 − 𝜇′𝑄Λ−1𝑄′ 𝑟𝑡 − 𝜇

= 1 − 𝜇𝑄′ Λ−1 𝑄𝑡 − 𝜇𝑄

= 1 − 𝑏𝑄′ 𝑄𝑡 − 𝜇𝑄

where the second equation follows by the Euler equation.

Assuming 𝜇~𝑁 0,𝜎2

𝑠2𝑉𝜂 , it follows that 𝜇𝑄 = 𝑄′𝜇 has the prior distribution

𝜇𝑄 = 𝑄′𝜇 ~ 𝑁 0,𝜎2

𝑠2𝑄′𝑉𝜂𝑄

~𝑁 0,𝜎2

𝑠2𝑄′𝑄Λ𝜂𝑄′𝑄

~ 𝑁 0,𝜎2

𝑠2Λ𝜂

12

Interpretations of the Ridge Regression As 𝑏𝑄 = Λ−1𝜇𝑄, its prior distribution is formulated as (Λ is assumed known):

𝑏𝑄 = Λ−1𝜇𝑄 ~ 𝑁 0,𝜎2

𝑠2Λ𝜂−2

For 𝜂 < 2, the variance of the 𝑏𝑄 coefficients associated with the smallest eigenvalues explodes.

For 𝜂 = 2, the pricing kernel coefficients 𝑏 = 𝑉−1𝜇 have the prior distribution

𝑏~𝑁 0,𝜎2

𝑠2𝐼𝑁

Picking 𝜂 = 2 makes the prior of 𝑏 independent of 𝑉.

Let us stick to this prior and further formulate the likelihood for 𝑏 as

𝑏~𝑁 𝑉−1 ො𝜇,1

𝑇𝑉−1

where ො𝜇 is the sample mean return.

Then, the posterior mean of 𝑏 is given by 𝐸 𝑏 = 𝑉 + 𝜆𝐼𝑁−1 ො𝜇 where 𝜆 =

𝑠2

𝑇𝜎2.

Ridge regression pricing kerenl projected on 𝑁 demeaned returns with a tuning parameter 𝜆 would deliver

the same E(b) coefficient.

Or the Ridge (biased) coefficient is equal to the posterior mean.

13

Interpretations of the Ridge Regression

To dig a bit deeper, note that the prior expected value of the squared Sharpe ratio (SR) is given by (V is assumed known):

𝐸 𝑆𝑅2 = 𝐸 𝜇′𝑉−1𝜇= 𝐸 𝜇′𝑄Λ−1𝑄′𝜇

= 𝐸 𝜇𝑄′ Λ−1𝜇𝑄

= 𝐸 𝑡𝑟𝑎𝑐𝑒 𝜇𝑄′ 𝛬−1𝜇𝑄

= 𝑡𝑟𝑎𝑐𝑒 𝛬−1𝐸 𝜇𝑄𝜇𝑄′

=𝜎2

𝑠2𝑡𝑟𝑎𝑐𝑒 𝛬𝜂−1

The Pastor-Stambaugh prior (𝜂 = 1 ) tells you that

𝐸 𝑆𝑅2 =𝜎2

𝑠2𝑡𝑟𝑎𝑐𝑒 𝐼𝑁 = 𝑁

𝜎2

𝑠2

That is, each principal component portfolio has the same expected contribution to the Sharpe ratio.

If 𝜂 = 2, then

𝐸 𝑆𝑅2 =

𝑗=1

𝑁𝜎2

𝑠2𝜆𝑗 = 𝜎2

Thus, the expected contribution of each PC is proportional to its eigenvalue.

14

Interpretations of the Ridge RegressionInterpretation #3: Eigen-values and Eigen-vectors

By the singular value decomposition, we can express 𝑋 as

𝑋𝑇×𝑀 = 𝑈𝑇×𝑀Λ𝑀×𝑀0.5 𝑉𝑀×𝑀

′

where 𝑈 = [𝑈1, … , 𝑈𝑀] is a 𝑇 × 𝑀 orthonromal matrix, Λ0.5 =𝜆10.5 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝜆𝑀

0.5is an 𝑀

×𝑀 matrix so that 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑀, and 𝑉 = 𝑉1, 𝑉2, … , 𝑉𝑀 is an 𝑀 ×𝑀 orthonormal matrix.

As 𝑋′𝑋 = 𝑉Λ𝑉′, the columns of V are the eigenvectors of 𝑋′𝑋 and (𝜆1, … , 𝜆𝑀) are the corresponding eigenvalues.

As 𝑋𝑋′ = 𝑈Λ𝑈′, the columns of U are the eigenvectors of 𝑋𝑋′.

U and V are equal if X is positive definite, i.e., 𝑌 = 𝑋′𝑋 = 𝑄𝛬𝑄′ with the usual interpretation that Q is the matrix of eigenvectors and 𝛬 is the matrix of eigenvalues.

U and V are equal if X is symmetric, then X’X=XX’.

𝑋′𝑋 and 𝑋𝑋′ are both positive definite, therefore their eigenvalues are positive and this explains why square roots of eigenvalues appear in the X decomposition.

15

Interpretations of the Ridge Regression Ridge: how is the predicted value of Y related to eigen vectors?

To answer, we would like first to find the eigenvectors and eigenvalues of the matrix 𝑍

𝑍 = 𝑋′𝑋 + 𝜆𝐼𝑀

We know that for every j=1,2…,M, the following holds by definition

𝑋′𝑋 𝑉𝑗 = 𝜆𝑗𝑉𝑗

Thus

𝑋′𝑋 + 𝜆𝐼𝑀 𝑉𝑗 = 𝑋′𝑋 𝑉𝑗 + 𝜆𝑉𝑗= 𝜆𝑗𝑉𝑗 + 𝜆𝑉𝑗= 𝜆𝑗 + 𝜆 𝑉𝑗

Telling you that 𝑉 still denotes the eigenvectors of 𝑍 while 𝜆𝑗 + 𝜆 is the 𝑗-th eigenvalue.

Notice now that if 𝐴 = 𝑉Λ𝑉′ then 𝐴𝐿 = 𝑉Λ𝐿𝑉′, while L can be either positive or negative

Same eigenvectors while eigenvalues are raised to the power of 𝐿.

16

Interpretations of the Ridge Regression Then, the inverse of the matrix Z is given by

𝑍−1 = 𝑉 𝑑𝑖𝑎𝑔1

𝜆1 + 𝜆,

1

𝜆2 + 𝜆,… ,

1

𝜆𝑀 + 𝜆𝑉′

Andመ𝛽ridge = 𝑍−1𝑋′𝑌

= 𝑋′𝑋 + 𝜆𝐼𝑀−1𝑋′𝑌

= 𝑉 𝑑𝑖𝑎𝑔𝜆10.5

𝜆1 + 𝜆,𝜆20.5

𝜆2 + 𝜆,… ,

𝜆𝑀0.5

𝜆𝑀 + 𝜆𝑈′𝑌

And the fitted value is

𝑌ridge = 𝑋 መ𝛽ridge

= 𝑗=1

𝑀

(𝑈𝑗𝜆𝑗

𝜆𝑗 + 𝜆𝑈𝑗′) 𝑌

= 𝑈 𝑑𝑖𝑎𝑔𝜆1

𝜆1 + 𝜆,

𝜆2𝜆2 + 𝜆

,… ,𝜆𝑀

𝜆𝑀 + 𝜆𝑈′𝑌

Hence, ridge regression projects 𝑌 onto components with large 𝜆𝑗.

Or, ridge regression shrinks the coefficients of low variance components.

Taking 𝜆 = 0, we are back with OLS. See also next page. 17

OLS – Eigen-values and Eigen-vectors based analysis

From the singular value decomposition, we, again, express 𝑋 as

𝑋𝑇×𝑀 = 𝑈𝑇×𝑀Λ𝑀×𝑀0.5 𝑉𝑀×𝑀

′

The OLS estimate is then given by

𝛽OLS = 𝑋′𝑋 −1𝑋′𝑌

= 𝑉Λ𝑉′ −1𝑉Λ0.5𝑈′𝑌

= 𝑉Λ−1𝑉′𝑉Λ0.5𝑈′𝑌

= 𝑉Λ−0.5𝑈′𝑌

= 𝑉 𝑑𝑖𝑎𝑔 𝜆1−0.5, 𝜆2

−0.5, … , 𝜆𝑀−0.5 𝑈′𝑌

And the fitted value is𝑌OLS = 𝑋 𝛽OLS

= 𝑈Λ0.5𝑉′𝑉 𝑑𝑖𝑎𝑔 𝜆1−0.5, 𝜆2

−0.5, … , 𝜆𝑀−0.5 𝑈′𝑌

= 𝑈Λ0.5Λ−0.5𝑈′𝑌= 𝑈𝑈′𝑌

= 𝑗=1

𝑀

(𝑈𝑗 𝑈𝑗′) 𝑌

The interpretation for the last equation is that we project 𝑌 on the M columns of U.

For comparison, Ridge gives stronger prominence for columns in U associated with high even values, while PCA considers only the first K<M columns (coming up next).

18

PCA Vs OLS and Ridge Regressions In a principal components analysis (PCA) setup, we project 𝑌 on a subset of 𝑈𝑗, 𝑗 = 1,2,… , 𝐾 < 𝑀.

Notice that 𝑈𝑗’s are ordered per the size of their corresponding eigenvalues in a descending order.

The 𝑋′𝑋 expression is approximated by using the 𝐾 largest eigenvectors and eigenvalues.

𝑋′𝑋 = 𝑉Λ𝑉′ ≈ ෨𝑉෩Λ ෨𝑉′ = 𝑉1, … , 𝑉𝐾 , 0𝑀×(𝑀−𝐾) 𝑑𝑖𝑎𝑔 𝜆1, 𝜆2, … , 𝜆𝐾 , 0(𝑀−𝐾)×1

𝑉1′⋮𝑉𝐾′

0(𝑀−𝐾)×𝑀

Then,

መ𝛽PCA= ෨𝑉෩Λ ෨𝑉′ −1𝑉Λ0.5𝑈′𝑌

= ෨𝑉෩Λ−1 ෨𝑉′𝑉Λ0.5𝑈′𝑌 = ෨𝑉෩Λ−1𝐼𝐾×𝐾 0𝐾×𝑀−𝐾

0𝑀−𝐾×𝐾 0𝑀−𝐾×𝑀−𝐾Λ0.5𝑈′𝑌

= ෨𝑉 𝑑𝑖𝑎𝑔 𝜆1−0.5, 𝜆2

−0.5, … , 𝜆𝐾−0.5, 0,0, . . 𝑈′𝑌

And the fitted value is

𝑌PCA = 𝑋 መ𝛽PCA

= 𝑈Λ0.5𝑉′𝑉 𝑑𝑖𝑎𝑔 𝜆1−0.5, 𝜆2

−0.5, … , 𝜆𝐾−0.5, 0,0, . . 𝑈′𝑌

= 𝑈 𝑑𝑖𝑎𝑔 1,1,… , 1,0,0, . . 𝑈′𝑌

= σ𝑗=1𝐾 (𝑈𝑗 𝑈𝑗

′) 𝑌

Similar to OLS but projecting Y on the K first columns of U, rather than the entire M columns.

Later section in the notes makes an association between PCA and autoencoder.

In brief, PCA is a linear way to extract latent factors, while autoencoder is a non-linear factor extraction scheme.

19

IPCA

IPCA - Instrumental Principal Component Analysis - per Kelly, Pruitt, and Su (2017, 2019).

Whereas PCA extracts latent factors and loadings based on the assumption of constant loadings,

IPCA allows loadings to vary with firm characteristics.

The factor model for excess returns is formulated as

𝑟𝑖,𝑡+1 = 𝛽′𝑖,𝑡𝑓𝑡+1 + 𝜖𝑖,𝑡+1𝛽′𝑖,𝑡 = 𝑥′𝑖,𝑡Γ𝛽

where 𝑓𝑡+1 is a K-vector of latent factors.

In the original paper, there is a residual in the 𝛽 equation to make beta random.

The loadings depend on observable asset characteristics contained in the 𝑀 × 1 vector

𝑥𝑖,𝑡 the first element is one for the intercept ,while Γ𝛽 is an 𝑀 ×𝐾 matrix.

Motivation: Gomes, Kogan, and Zhang (2003) formulate an equilibrium model where

beta varies with firm level predictors, such as size and book-to-market.

Avramov and Chordia (2006) show empirically that conditional beta that varies with

characteristics improves model pricing abilities.

Characteristics can simply reflect covariances, or risk sources. 20

IPCA

Rewriting the model in a vector form:

𝑟𝑡+1 = 𝑋𝑡Γ𝛽𝑓𝑡+1 + 𝜖𝑡+1where 𝑟𝑡+1 is an 𝑁 × 1 vector of excess returns, 𝑋𝑡 is 𝑁 ×𝑀 matrix of the characteristics, and 𝜖𝑡+1is an 𝑁 × 1 vector of residuals.

The estimation objective is to minimize

minΓ𝛽 ,𝐹

𝑡=1

𝑇−1

𝑟𝑡+1 − 𝑋𝑡Γ𝛽𝑓𝑡+1 ′ 𝑟𝑡+1 − 𝑋𝑡Γ𝛽𝑓𝑡+1

From the first order condition, we get that for 𝑡 = 1 , 2, … , 𝑇 − 1

መ𝑓𝑡+1 = Γ𝛽′𝑋𝑡′𝑋𝑡 Γ𝛽−1 Γ𝛽

′𝑍𝑡′𝑟𝑡+1

and

𝑣𝑒𝑐 Γ𝛽′ = σ𝑡=1𝑇−1𝑋𝑡

′𝑋𝑡⨂ መ𝑓𝑡+1 መ𝑓𝑡+1′−1

σ𝑡=1𝑇−1 𝑋𝑡⨂ መ𝑓𝑡+1′ ′𝑟𝑡+1 .

21

IPCA

Estimation can be made through a Gibbs Sampling (MCMC) method.

Kelly, Pruitt, and Su (2017) propose ways of solving the system as well as they give an interesting

managed portfolio based interpretation to the problem.

In a static latent factor model stock return are formulated as

𝑟𝑡 = 𝛽𝑓𝑡 + 𝜖𝑡

and the PCA factor solution is

መ𝑓𝑡 = (𝛽′𝛽)−1𝛽′𝑟𝑡

IPCA is analogous, while it accounts for dynamic instrumented betas.

22

IPCA

One can also allow for mispricing, where the intercepts in the unrestricted factor model

vary with the same firm characteristics.

Then the model is formulated as

𝑟𝑖,𝑡+1 = 𝑥′𝑖,𝑡Γ𝛼 + 𝑥′𝑖,𝑡Γ𝛽𝑓𝑡+1 + 𝜖𝑖,𝑡+1

where Γ𝛼 is an 𝐿 × 1 vector.

Let ෨Γ = Γ𝛼 , Γ𝛽 and ሚ𝑓𝑡+1 = 1, 𝑓𝑡+1′ ′.

Then, the model can be rewritten in a matrix form

𝑟𝑡+1 = 𝑋𝑡 ෨Γ ሚ𝑓𝑡+1 + 𝜖𝑡+1

From the first-order minimization condition, we get for 𝑡 = 1 , 2, … , 𝑇 − 1

መሚ𝑓𝑡+1 =෨Γ𝛽′𝑋𝑡′𝑋𝑡

෨Γ𝛽−1

෨Γ𝛽′𝑋𝑡′ 𝑟𝑡+1 − 𝑋𝑡

෨Γ𝛼

and

𝑣𝑒𝑐 ෨Γ′ = σ𝑡=1𝑇−1𝑋𝑡′𝑋𝑡⨂

መሚ𝑓𝑡+1መሚ𝑓𝑡+1′

−1σ𝑡=1𝑇−1 𝑋𝑡⨂

መሚ𝑓𝑡+1′ ′𝑟𝑡+1 .23

IPCA

Kelly, Pruitt, and Su (2017 use sample from July 1962 to May 2014 that consists of

12,813 firms with 36 characteristics.

They measure 𝑅2 both in-sample and out-of-sample for the restricted and unrestricted

models with the number of latent factors ranging from 1 to 6.

When there are five and more latent factors, the hypothesis Γ𝛼 = 0 can not be rejected

while the performance of the restricted and unrestricted models are quite similar at the

individual stock level.

24

Lasso

Next, we consider various Lasso (least absolute shrinkage and selection operator) models.

Tibshirani (1996) was the first to introduce Lasso.

Lasso simultaneously performs variable selection and coefficient estimation via shrinkage.

While the ridge regression implements an 𝑙2-penalty, Lasso is an 𝑙1-optimization:

min (𝑌 − 𝑋𝛽)′(𝑌 − 𝑋𝛽) s. t.

𝑗=1

𝑀

|𝛽𝑗| ≤ 𝑐

The 𝑙1 penalization approach is called basis pursuit in signal processing.

We have again a non-negative tuning parameter 𝜆 that controls the amount of regularization:

ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆𝑗=1

𝑀

|𝛽𝑗|

Both Ridge and Lasso have solutions even when 𝑋′𝑋 may not be of full rank (e.g., when there are more

explanatory variables than time-series observations) or ill conditioned.

25

Lasso

Unlike Ridge, the Lasso coefficients cannot be expressed in closed form.

However, Lasso can provide with a set of sparse solutions.

This improves the interpretability of regression models.

Large enough 𝜆, or small enough 𝑐, will set some coefficients exactly to zero.

To understand why, notice that LASSO can be casted as having a Laplace prior 𝛽

𝑃 𝛽 𝜆 ∝𝜆

2𝜎exp −

𝜆 𝛽

𝜎

In particular, Lasso obtains by combining Laplace prior and normal likelihood.

Like the normal distribution, Laplace is symmetric.

Unlike the normal distribution, Laplace has a spike at zero (first derivative is discontinuous)

and it concentrates its probability mass closer to zero than does the normal distribution.

This could explain why Lasso (Laplace prior) sets some coefficients to zero, while Ridge

(normal prior) does not.

26

Lasso

Bayesian information criterion (BIC) is often used to pick 𝜆.

BIC (like other model selection criteria) is composed of two components: the sum of squared

regression errors and a penalty factor that gets larger as the number of retained characteristics

increases.

𝐵𝐼𝐶 = 𝑇 × 𝑙𝑜𝑔𝑅𝑆𝑆

𝑇+ 𝑙 × 𝑙𝑜𝑔 𝑇 , where 𝑙 𝑖s the number of retained characteristics.

Notice that different values of lambda affect the optimization in a way that a different set of

characteristics is selected.

You choose 𝜆 as follows: initiate a range of values, compute BIC for each value, and pick the

one that minimizes BIC.

The next page provides steps for assessing the maximum value of 𝜆 in formulating the range.

27

Lasso

For convenience, let us formulate again the objective function

ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆σ𝑗=1𝑀 |𝛽𝑗|

If a change of 𝛽 does not decrease the objective function ℒ 𝛽 , then it is a local minimum.

An infinitesimal change in 𝛽𝑗, 𝜕𝛽𝑗 would change the objective function ℒ 𝛽 as follows

The penalty term changes by 𝜆𝑠𝑖𝑔𝑛 𝛽𝑗 𝜕𝛽𝑗

The squared error term changes by 𝜕𝑅𝑆𝑆 = 2𝑌′𝑋𝑗 + 2𝛽𝑗 𝜕𝛽𝑗 , where 𝑋𝑗 is the 𝑗’s row of the

matrix 𝑋.

If 𝛽 = 0 then 𝜕𝑅𝑆𝑆 = 2𝑌′𝑋𝑗 𝜕𝛽𝑗

For the objective function to decrease, the change in the RSS should be greater than the change in

penalty:

2𝑌′𝑋𝑗 𝜕𝛽𝑗

𝜆𝑠𝑖𝑔𝑛 𝛽𝑗 𝜕𝛽𝑗> 1 , hence 𝜆 < 2𝑌′𝑋𝑗 , or

𝜆 = max𝑗

2𝑌′𝑋𝑗

28

Adaptive Lasso LASSO forces coefficients to be equally penalized.

One modification is to assign different weights to different coefficients:

ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆σ𝑗=1𝑀 𝑤𝑗|𝛽𝑗|

It can be shown that if the weights are data driven and are chosen in the right way such weighted LASSO can

have the so-called oracle properties even when the LASSO does not. This is the adaptive LASSO.

For instance, 𝑤𝑗 can be chosen such that it is equal to one divided by the absolute value of the corresponding

OLS coefficient raised to the power of 𝛾 > 0. That is, 𝑤𝑗 =1

𝛽𝑗𝛾 for 𝑗 = 1,… ,𝑀.

The adaptive LASSO estimates are given by

ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆σ𝑗=1𝑀 𝑤𝑗|𝛽𝑗|

Hyper-parameters 𝜆 and 𝛾 can be chosen using model selection criteria.

The adaptive LASSO is a convex optimization problem and thus does not suffer from multiple local minima.

Later, I describe how adaptive LASSO has been implemented in asset pricing through an RFS paper.

29

Bridge regression

Frank and Friedman (1993) introduce the bridge regression.

This specification generalizes for ℓq penalty.

The optimization is given by

ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆σ𝑗=1𝑀 𝛽𝑗

𝑞

Notice that 𝑞 = 0, 1, 2, correspond to OLS, LASSO, and ridge, respectively.

Moreover, the optimization is convex for 𝑞 ≥ 1 and the solution is sparse for 0≤ 𝑞 ≤ 1.

Eventually, q is an hyperparameter that has to be selected.

30

The Elastic Net The elastic net is yet another regularization and variable selection method.

Zou and Hastie (2005) describe it as stretchable fishing net that retains “all big fish”.

Using simulation, they show that it often outperforms Lasso in terms of prediction

accuracy.

The elastic net encourages a grouping effect, where strongly correlated predictors tend to

be in or out of the model together.

The elastic net is particularly useful when the number of predictors is much bigger than

the number of observations.

The naïve version of the elastic net is formulated through

ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆1𝑗=1

𝑀

|𝛽𝑗| + 𝜆2𝑗=1

𝑀

𝛽𝑗2

Thus, the elastic net combines 𝑙1 and 𝑙2 norm penalties.

It still produces sparse representations.

31

Group Lasso

Suppose that the 𝑀 predictors are divided into 𝐿 groups, with 𝑀𝑙 denoting the number of predictors in group 𝑙.

𝑋𝑙 represents the predictors corresponding to the 𝑙-th group, while 𝛽𝑙 is the corresponding coefficient vector.

Notice, 𝛽=[𝛽1′, 𝛽2

′ , … , 𝛽𝐿′]’.

Assume that both 𝑌 and 𝑋 have been centered (no intercept).

The group Lasso solves the convex optimization problem.

ℒ 𝛽 = 𝑌 −

𝑙=1

𝐿

𝑋𝑙𝛽𝑙

′

𝑌 −

𝑙=1

𝐿

𝑋𝑙𝛽𝑙 + 𝜆

𝑙=1

𝐿

𝛽𝑙′𝛽𝑙

12

The group Lasso does not yield sparsity within a group.

If a group of parameters is non-zero, they will all be non-zero.

The sparse group Lasso criterion does yield sparsity

ℒ 𝛽 = 𝑌 −

𝑙=1

𝐿

𝑋𝑙𝛽𝑙

′

𝑌 −

𝑙=1

𝐿

𝑋𝑙𝛽𝑙 + 𝜆1

𝑙=1

𝐿

𝛽𝑙′𝛽𝑙

12

+ 𝜆2

𝑗=1

𝑀

𝛽𝑗

32

Non parametric non-linear shrinkage

Lasso, Adaptive Lasso, Group Lasso, Ridge, Bridge, and Elastic net are all linear or parametric approaches for shrinkage.

Group Lasso implements the same penalty to predictors belonging to some pre-specified group while different penalty applies to different groups.

Some other parametric approaches (uncovered here) include the smoothed clip absolute deviation (SCAD) penalty of Fang and Li (2001) and Fang and Peng (2004) and the minimum concave penalty of Zhang (2010).

In many application, however, there is little a priori justification for assuming that the effects of covariates take a linear form or belong to other known parametric family.

Huang, Horowitz, and Wei (2010) thus propose to use a nonparametric approach: the adaptive group Lasso for variable selection.

This approach is based on a spline approximation to the nonparametric components.

To achieve model selection consistency, they apply Lasso in two steps.

First, they use group Lasso to obtain an initial estimator and reduce the dimension of the problem.

Second, they use the adaptive group Lasso to select the final set of nonparametric components.

33

In finance, Cochrane (2011) notes that portfolio sorts are the same thing as nonparametric cross section regressions.

Drawing on Huang, Horowitz, and Wei (2010), Freyberger, Neuhier, and Weber (2017) study this equivalence formally.

The cross section of stock returns is modelled as a non-linear function of firm characteristics:

𝑟𝑖𝑡 = 𝑚𝑡 𝐶1,𝑖𝑡−1, … , 𝐶𝑆,𝑖𝑡−1 + 𝜖𝑖𝑡

Notation:

𝑟𝑖𝑡 is the return on firm i at time t.

𝑚t is a function of S firm characteristics 𝐶1, 𝐶2, … , 𝐶𝑆.

Notice, 𝑚t itself is not stock specific but firm characteristics are.

Consider an additive model of the following form

𝑚𝑡 𝐶1, … , 𝐶𝑆 = σ𝑠=1𝑆 𝑚𝑡,𝑠(𝐶𝑠)

As the additive model implies that 𝜕2𝑚𝑡 𝑐1,…,𝑐𝑆

𝜕𝑐𝑠𝜕𝑐𝑠′=0 for 𝑠 ≠ 𝑠′, apparently there should not be no cross dependencies

between characteristics.

Such dependencies can still be accomplished through producing more predictors as interactions between characteristics.

Non-parametric non-linear Models

34

For each characteristic s, let 𝐹𝑠,𝑡 ∙ be a strictly monotone function and let 𝐹𝑠,𝑡−1 ∙ denote its inverse.

Define ሚ𝐶𝑠,𝑖𝑡−1 = 𝐹𝑠,𝑡 𝐶𝑠,𝑖𝑡−1 such that ሚ𝐶𝑠,𝑖𝑡−1 ∈ 0,1 .

That is, characteristics are monotonically mapped into the [0,1] interval.

An example for 𝐹𝑠,𝑡 ∙ is the rank function: 𝐹𝑠,𝑡 𝐶𝑠,𝑖𝑡−1 =𝑟𝑎𝑛𝑘 𝐶𝑠,𝑖𝑡−1

𝑁𝑡+1, where 𝑁𝑡 is the total number of firms at time t.

The aim then is to find 𝑚𝑡 such that

𝑚𝑡 𝐶1, … , 𝐶𝑆 = ෦𝑚𝑡ሚ𝐶1,𝑖𝑡−1, … , ሚ𝐶𝑠,𝑖𝑡−1

In particular, to estimate the ෦𝑚𝑡 function, the normalized characteristic interval 0,1 is divided into L subintervals

(L+1 knots): 0 = 𝑥0 < 𝑥1 < ⋯ < 𝑥𝐿−1 < 𝑥𝐿 = 1.

To illustrate, consider the equal spacing case.

Then, 𝑥𝑙 =𝑙

𝐿for l=0,…,L-1 and the intervals are:

෩𝐼1 = 𝑥0, 𝑥1 , ෩𝐼𝑙 = 𝑥𝑙−1, 𝑥𝑙 for l=2,…,L-1, and ෩𝐼𝐿 = [𝑥𝐿−1, 𝑥𝐿]

Nonparametric Models

35

Each firm characteristic is transformed into its corresponding interval.

Estimating the unknown function 𝑚𝑡,𝑠 nonparametricaly is done by using quadratic splines.

The function 𝑚𝑡,𝑠 is approximated by a quadratic function on each interval ෩𝐼𝑙.

Quadratic functions in each interval are chosen such that 𝑚𝑡,𝑠 is continuous and differentiable in the whole interval

0,1 .

𝑚𝑡,𝑠 ǁ𝑐 = σ𝑘=1𝐿+2 𝛽𝑡𝑠𝑘 × 𝑝𝑘 ǁ𝑐 , where 𝑝𝑘 ǁ𝑐 are basis functions and 𝛽𝑡𝑠𝑘 are estimated slopes.

In particular, 𝑝1 𝑦 = 1, 𝑝2 𝑦 = 𝑦, 𝑝3 𝑦 = 𝑦2, and 𝑝𝑘 𝑦 = max 𝑦 − 𝑥𝑘−3, 02 for 𝑘 = 4, … , 𝐿 + 2

In that way, you can get a continuous and differentiable function.

To illustrate, consider the case of two characteristics, e.g., size and book to market (BM), and 3 intervals.

Then, the 𝑚𝑡 function is:

𝑚𝑡 ǁ𝑐𝑖,𝑠𝑖𝑧𝑒 , ǁ𝑐𝑖,𝐵𝑀 =

= 𝛽𝑡, 𝑠𝑖𝑧𝑒, 1 × 1 + 𝛽𝑡, 𝑠𝑖𝑧𝑒, 2 × ǁ𝑐𝑖,𝑠𝑖𝑧𝑒 + 𝛽𝑡, 𝑠𝑖𝑧𝑒, 3 × ǁ𝑐𝑖,𝑠𝑖𝑧𝑒2 + 𝛽𝑡, 𝑠𝑖𝑧𝑒, 4 ×max ǁ𝑐𝑖,𝑠𝑖𝑧𝑒 − 1/3,0

2+ 𝛽𝑡, 𝑠𝑖𝑧𝑒, 5 ×max ǁ𝑐𝑖,𝑠𝑖𝑧𝑒 − 2/3,0

2

+ 𝛽𝑡, 𝐵𝑀, 1 × 1 + 𝛽𝑡, 𝐵𝑀, 2 × ǁ𝑐𝑖,𝐵𝑀 + 𝛽𝑡, 𝐵𝑀, 3 × ǁ𝑐𝑖,𝐵𝑀2 + 𝛽𝑡, 𝐵𝑀, 4 ×max ǁ𝑐𝑖,𝐵𝑀 − 1/3,0

2+ 𝛽𝑡, 𝐵𝑀, 5 ×max ǁ𝑐𝑖,𝐵𝑀 − 2/3,0

2

Nonparametric Models

36

The estimation of 𝑚𝑡 is done in two steps:

First step, estimate the slope coefficients 𝑏𝑠𝑘 using the group Lasso routine:

𝛽𝑡 = argmin𝑏𝑠𝑘:𝑠=1,…,𝑆;𝑘=1,…,𝐿+2

𝑖=1

𝑁𝑡

𝑟𝑖𝑡 −

𝑠=1

𝑆

𝑘=1

𝐿+2

𝑏𝑠𝑘 × 𝑝𝑘 ሚ𝐶𝑠,𝑖𝑡−1

2

+ 𝜆1

𝑠=1

𝑆

𝑘=1

𝐿+2

𝑏𝑠𝑘2

1/2

Altogether, the number of 𝑏𝑠𝑘 coefficients is 𝑆 × 𝐿 + 2 .

The second expression is a penalty term applied to the spline expansion.

𝜆1 is chosen such that it minimizes the Bayesian Information Criterion (BIC).

The essence of group Lasso is either to include or exclude all L+2 spline terms associated with a given characteristic.

While this optimization yields a sparse solution there are still many characteristics retained.

To include only characteristics with a strong predictive the adaptive Lasso is then employed.

Adaptive group Lasso

37

To implement adaptive group Lasso, define the following weights using estimates for 𝑏𝑠𝑘 from the first step:

𝑤𝑡𝑠 =

𝑘=1

𝐿+2

෨𝑏𝑠𝑘2

−12

𝑖𝑓

𝑘=1

𝐿+2

෨𝑏𝑠𝑘2 ≠ 0

∞ 𝑖𝑓

𝑘=1

𝐿+2

෨𝑏𝑠𝑘2 = 0

Then, estimate again the coefficients 𝑏𝑠𝑘 using the above-estimated weights 𝑤𝑡𝑠

𝛽𝑡 = argmin𝑏𝑠𝑘:𝑠=1,…,𝑆;𝑘=1,…,𝐿+2

𝑖=1

𝑁𝑡

𝑟𝑖𝑡 −

𝑠=1

𝑆

𝑘=1

𝐿+2

𝑏𝑠𝑘 × 𝑝𝑘 ሚ𝐶𝑠,𝑖𝑡−1

2

+ 𝜆2

𝑠=1

𝑆

𝑤𝑡𝑠

𝑘=1

𝐿+2

𝑏𝑠𝑘2

1/2

𝜆2 is chosen such that it minimizes BIC.

The weights 𝑤𝑡𝑠 guarantee that we do not select any characteristic in the second step that was not selected in the

first step.

Adaptive group Lasso

38

Regression Trees Also a nonparametric approach for incorporating interactions among predictors.

At each step, a new branch sorts the data from the preceding step into two bins based on one of the predictive

variables.

Let us denote the data set from the preceding step as 𝐶 and the two new bins as 𝐶𝑙𝑒𝑓𝑡 and 𝐶𝑟𝑖𝑔ℎ𝑡.

Let us denote the number of elements in 𝐶, 𝐶𝑙𝑒𝑓𝑡 , 𝐶𝑟𝑖𝑔ℎ𝑡 by 𝑁,𝑁𝑙𝑒𝑓𝑡 , 𝑁𝑟𝑖𝑔ℎ𝑡 , respectively.

The specific predictor variable and its threshold value are chosen to minimize the sum of squared forecast

errors

ℒ 𝐶, 𝐶𝑙𝑒𝑓𝑡 , 𝐶𝑟𝑖𝑔ℎ𝑡 = H 𝜃𝑙𝑒𝑓𝑡 , 𝐶𝑙𝑒𝑓𝑡 + H 𝜃𝑟𝑖𝑔ℎ𝑡 , 𝐶𝑟𝑖𝑔ℎ𝑡

where H 𝜃, 𝑋 = σ𝑧𝑖,𝑡∈𝑋𝑟𝑖,𝑡+1 − 𝜃

2.

We should also sum through the time-series dimension. I skip it to ease notation.

We compute ℒ 𝐶, 𝐶𝑙𝑒𝑓𝑡 , 𝐶𝑟𝑖𝑔ℎ𝑡 for each predictor and choose the one with the minimum loss.

39

Regression Trees The predicted return is the average of returns of all stocks within the group

𝜃𝑙𝑒𝑓𝑡 =1

𝑁𝑙𝑒𝑓𝑡

𝑧𝑖,𝑡∈𝐶𝑙𝑒𝑓𝑡

𝑟𝑖,𝑡+1 ; 𝜃𝑟𝑖𝑔ℎ𝑡 =1

𝑁𝑟𝑖𝑔ℎ𝑡

𝑧𝑖,𝑡∈𝐶𝑟𝑖𝑔ℎ𝑡

𝑟𝑖,𝑡+1

The division into lower leaves terminates when the number of leaves or the depth of the tree

reach a pre-specified threshold.

The prediction of tree with 𝐾 leaves (terminal nodes), and depth 𝐿, can be written as

𝑔 𝑧𝑖,𝑡; 𝜃, 𝐾, 𝐿 =

𝑘=1

𝐾

𝜃𝑘 1{𝑧𝑖,𝑡 ∈𝐶𝑘(𝐿)}

where 𝐶𝑘(𝐿) is one of the 𝐾 partitions of the data.

In essence, a stock belongs to a group and the stock predicted return is equal to the group

average.

40

Regression Trees - Example

41

Regression Trees

Pros

Tree model is invariant to monotonic transformations of the predictors.

It can approximate nonlinearities.

A tree of depth 𝐿 can capture, at most, L-1 interactions.

Cons

Prone to over fit and therefore they are not used without regularization.

42

Random Forest

Random forest is an ensemble method that combines forecasts from many

different shallow trees.

For each tree a subset of predictors is drawn at each potential branch

split.

This lowers the correlation among different trees and improves the

prediction.

The depth, 𝐿, of the trees is a tuning (hyper) parameter and is optimized

in the validation stage.

43

Neural Networks

A nonlinear feed forward method.

The network consists of an “input layer”, one or more hidden layers, and

an output layer.

Analogous to axons in a biological brain, layers of the networks represent

groups of “neurons” with each layer connected by “synapses” that

transmit signals among neurons of different layer.

Deep learning reflects the notion that the number of hidden layers is

large.

In finance, two or more layers is perceived to be large enough.

44

Neural Networks

45

input

layerhidden

layer1

output

layer

hidden

layer2

Neural Networks

Each neuron applies a nonlinear “activation function” f to its aggregated signal before sending its output to the next layer:

𝑥𝑘𝑙 = 𝑓(𝜃0 +

𝑗

𝑧𝑗𝜃𝑗)

where 𝑥𝑘𝑙 is neuron 𝑘 ∈ 1,2, … , 𝐾𝑙 in the hidden layer 𝑙 ∈ 1,2, … , 𝐿.

The function 𝑓 is usually one of the following functions:

Sigmoid 𝜎 𝑥 =1

(1+𝑒−𝑥)

tanh 𝑥 = 2𝜎 𝑥 − 1

𝑅𝑒𝐿𝑈(𝑥) = ቊ0 𝑖𝑓 𝑥 < 0𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

The activation function is operationalized in each Neuron excluding the output layer.

Linear activation function boils down to OLS.

It is often confusing – Neural Network only captures non linearities – it does not capture interactions between characteristics.

This could be shown by the formulation on the next page.

46

Neural Networks

For the ReLU activation function, we can rewrite the neural network function as:

𝐹𝑖𝑡𝑡𝑒𝑑 𝑉𝑎𝑙𝑢𝑒 = max max max 𝑋𝑊hl1 , 0 𝑊

hl

2, 0 …𝑊hl

n , 0 𝑊output

where X is the input, 𝑊hl𝑖 are the weight matrix of the neurons in hidden layer

𝑖 ∈ 1, … , 𝑛, n is the number of hidden layers, and 𝑊output are the weighs of the output layer.

Then, run an optimization that minimizes the sum of squared errors, just like OLS.

Should also include a LASSO routine to zero out some coefficients (does not essentially

translate into zeroing-out some characteristics).

If the activation function is linear – simply ignore the MAX operator in the above equation.

Then the fitted value is XW – boiling down to OLS.

47

Two inputs: size and BM (book to market)

One hidden layer with three neurons: A, B, and C

W’s are the weights and b’s are the intercepts (biases).

𝑖𝑛𝑝𝑢𝑡𝐴 = 𝑠𝑖𝑧𝑒 ×𝑊𝑠𝑖𝑧𝑒𝐴 + 𝐵𝑀 ×𝑊𝐵𝑀

𝐴 + 𝑏𝐴

𝑜𝑢𝑡𝑝𝑢𝑡𝐴 = 𝑚𝑎𝑥 𝑖𝑛𝑝𝑢𝑡𝐴, 0

𝑖𝑛𝑝𝑢𝑡𝐵 = 𝑠𝑖𝑧𝑒 ×𝑊𝑠𝑖𝑧𝑒𝐵 + 𝐵𝑀 ×𝑊𝐵𝑀

𝐵 + 𝑏𝐵

𝑜𝑢𝑡𝑝𝑢𝑡𝐵 = 𝑚𝑎𝑥 𝑖𝑛𝑝𝑢𝑡𝐵 , 0

𝑖𝑛𝑝𝑢𝑡𝐶 = 𝑠𝑖𝑧𝑒 ×𝑊𝑠𝑖𝑧𝑒𝐶 + 𝐵𝑀 ×𝑊𝐵𝑀

𝐶 + 𝑏𝐶

𝑜𝑢𝑡𝑝𝑢𝑡𝐶 = 𝑚𝑎𝑥 𝑖𝑛𝑝𝑢𝑡𝐶 , 0

Output layer (ol): 𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑜𝑢𝑡𝑝𝑢𝑡𝐴 ×𝑊𝐴𝑜𝑙 + 𝑜𝑢𝑡𝑝𝑢𝑡𝐵 ×𝑊𝐵

𝑜𝑙+𝑜𝑢𝑡𝑝𝑢𝑡𝐶 ×𝑊𝐶𝑜𝑙+𝑏𝑜𝑙

The output is the predicted return.

You implement that procedure for any stock while the W-s are the same across stocks.

To find W-s, you minimize the sum squared errors (data versus output), by aggregating across all stocks and all out-of-sample months.

Account for LASSO to zero-out some of the weights (not essentially some of the characteristics).

A simple example

size

BM

input

layer

hidden

layer

output

layer

A

B

C

ol

48

Neural network with classification

Borisenko (2019) formulates return prediction through a classification problem.

The probability of the next month return being above or below the median cross sectional return can be described as:

yi,t+1 = ቊ1, 𝑖𝑓 𝑟i,t+1 > 𝑚𝑒𝑑𝑖𝑎𝑛 (rt+1)

0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Let ො𝑝𝑎 be the predicted probability of stock i’s return being above the median of the cross-sectional return distribution at time t + 1.

A binary classifier models that probability using the Sigmoid function

ො𝑝i,t+1

𝑎 = 𝑓 𝑋i,t+1;ෝ𝑤 =1

1 + 𝑒−ℎ(𝑋i,t+1;ෝ𝑤)

where ℎ(𝑋i,t+1;ෝ𝑤) is the value of the output layer, as in NN it is a nonlinear function of 𝑋i,t+1.

• The loss function is formulated as

𝐿 𝑤 = −1

𝑁𝑇σ𝑖=1𝑁 σ𝑡=1

𝑇 yi,t+1 log ො𝑝i,t

𝑎 + 1 − yi,t+1 log(1 − ( ො𝑝i,t

𝑎 ) .

49

Autoencoding: unconditional asset pricing

To begin with, consider PCA.

PCA is a statistical routine in which the output variables (K<N common factors) attempt to approximate the input variables (N stock returns) through a linear mechanism of dimension reduction.

An autoencoder is also a dimension reduction device but through non-linear (more specifically neural networks) based compression of the input variables.

Thus, the objective is to extract common factors in ways that are more general than PCA.

With a linear activation function, the autoencoder collapses to PCA.

Otherwise (e.g., with ReLU activation function), common factors are extracted nonlinearly.

The autoencoder encapsulates two steps: encoding and decoding.

Encoding: the input variables pass through a small number of neurons in one or more hidden layers forging a compressed representation of the input.

Decoding: the compressed representation is then unpacked and mapped into the output layer.

The figure on page 53 is telling the story, focusing on a single hidden layer.

The green circles represent the input variables, the purple circles represent the fully connected hidden nodes, and the red circles represent the output variables.

50

Unconditoinal Autoencoding

The transition from the green to purple establishes the encoding part.

The transition from the purple circles to the red circles represents decoding.

The encoding part produces the lower dimension set of latent factors.

The decoding part reconstructs returns from the latent factors.

In this particular example, there are three latent factors (K=3).

The setup is unconditional because factor loadings are fixed over time.

Model coefficients are estimated through minimizing the sum squared errors

between actual returns and reconstructed returns.

Both autoencoders and PCA are unsupervised, as they attempt to model the full

panel of asset returns using only the returns themselves as inputs.

Recall, the previous implementation of neural networks considers a non-linear

regression of returns on predictive characteristics – thus learning is supervised.

51

Autoencoding: unconditional asset pricing

52

Autoencoding: Conditional Asset Pricing

53

Gu, Kelly, and Xu (2019) implement autoencoding in a more complex setup where betas vary non-linearly with predictive characteristics.

Their conditional autoencoder extends the unconditional one since loadings are time varying rather than fixed.

The conditional autoencoder also extends IPCA in which loadings are merely a linear function of predictive characteristics.

The figure on page 58 details the complexity of conditional autoencoder (CA).

The left side of the network models factor loadings as a nonlinear function of predictive characteristics, while the right-side network formulates factors as portfolios of individual stock returns.

Let us start with the left side.

The yellow level corresponds to a panel of predictive characteristics – N stocks Pcharacteristics per stock.

Then characteristics are transformed through hidden layers to form the output variables.


54

The output layer consists of K neurons per stock standing for the K latent

factors, to be extracted below.

Thus, the output variables are time varying betas, at the stock level.

There are K betas for N stocks.

Moving to the right side, K factors are extracted from N returns.

Then, for each stock we multiply betas with the corresponding factors.

This forms stock level predicted return.

It should be noted that just like the unconditional autoencoder boils down

to PCA when the activation function is linear, CA boils down to IPCA

when the activation function is linear.

Gu, Kelly, and Xu (2019) consider a range of CA architectures with

varying degrees of complexity.


55

The simplest (CA0) uses a single linear layer in both the beta and factor

networks making it similar (but not identical) to IPCA.

Next, CA1 adds a hidden layer with 32 neurons in the beta network.

Finally, CA2 and CA3 add a second and third hidden layer, with 16 and 8

neurons respectively, to the beta side.

CA0 through CA3 all maintain a one-layer linear specification on the

factor side of the model.

In these cases, the only variation in factor specification is in the number

of neurons that ranges from 1 to 6 (corresponding to the number of factors

in the model).

56

Autoencoders asset pricing

In Gu, Kelly, and Xiu (2019), the general factor model is

𝑟𝑖,𝑡 = 𝛽 𝑧𝑖,𝑡−1′𝑓𝑡 + 𝑢𝑖,𝑡

As noted, the loading 𝛽 is a non-linear function of firm characteristics.

The function will be K. implemented by a neural network where the inputs are the firm characteristics and the

output has a dimension K.

The number of non-linear hidden layers in the neural network varies from 0 to 3. Zero hidden layer is similar but

not identical to the IPCA.

The factors 𝑓𝑡 will be implemented by another neural network where the inputs are “managed” portfolios

𝑥𝑡 = 𝑍′𝑡−1𝑍𝑡−1−1𝑍′𝑡−1rt

𝑥𝑡 is 𝑃 × 1 vector of long short portfolio based on the firms’ characteristics.

The reason for using the managed portfolios instead of the individual stocks returns is unbalanced panel.

The output of the factors neural network is of dimension K.

In the paper they use one linear layer for the factor neural network.

There is a difference between the autoencoder and IPCA factor formation.

In the IPCA there is a constraint that relates the loading and the factors, namely Γ𝛽. In the autoencoder they are

independent.

57

Autoencoders and asset pricing

58


In the following slides there is a comparison between several asset pricing models.

FF indicates observed factors like MKT, SMB, HML, CMA, RMW, and UMD

PCA - constant loadings

IPCA – the loadings vary linearly with characteristics

AC0 − AC3 autoencoder with 0-3 hidden layers for estimating the loading function, as noted earlier.

The performance measures are:

𝑅𝑡𝑜𝑡𝑎𝑙2 = 1 −

σ 𝑖,𝑡 ∈𝑂𝑂𝑆 𝑟𝑖,𝑡 − መ𝛽𝑖,𝑡−1′ መ𝑓𝑡

2

σ 𝑖,𝑡 ∈𝑂𝑂𝑆 𝑟𝑖,𝑡2

𝑅𝑝𝑟𝑒𝑑2 = 1 −

σ 𝑖,𝑡 ∈𝑂𝑂𝑆 𝑟𝑖,𝑡 − መ𝛽𝑖,𝑡−1′ መ𝜆𝑡−1

2

σ 𝑖,𝑡 ∈𝑂𝑂𝑆 𝑟𝑖,𝑡2

where መ𝜆𝑡−1 is the sample mean of መ𝑓 up to 𝑡 − 1.

59


60


61


Generative Adversarial Network - GAN

GAN is a setup with two neural networks contesting with each other in a

(often zero-sum) game.

For example, let 𝑤 and 𝑔 be two neural networks’ outputs.

The loss function is defined over both outputs, 𝐿 𝑤, 𝑔 .

The competition between the two neural networks is done via iterating

both 𝑤 and 𝑔 sequentially:

𝑤 is updated by minimizing the loss while 𝑔 is given

ෝ𝑤 = min𝑤

𝐿(𝑤|𝑔)

𝑔 is the adversarial and it is updated by maximizing the loss while 𝑤 is given

ො𝑔 = max𝑔

𝐿(𝑔|𝑤)

62

Adversarial GMM

Chen, Pelger, and Zhu (2019) employ an adversarial GMM to estimate the Stochastic Discount Factor (SDF).

Adversarial GMM is using a GAN for solving conditional moment conditions.

The model is formulated as follows:

For any excess return, the conditional expectation at time t is

Ε𝑡 𝑀𝑡+1𝑅𝑡+1,𝑖𝑒 = 0

where 𝑀𝑡+1 is the SDF and 𝑅𝑡+1,𝑖𝑒 is security’s 𝑖 excess return at time

𝑡 + 1. This equality should hold for any 𝑖 = 1,… , 𝑁

The SDF is of the form

𝑀𝑡+1 = 1 −

𝑖=1

𝑁

𝑤𝑡,𝑖𝑅𝑡+1,𝑖𝑒

where 𝑤𝑡,𝑖 is security’s 𝑖 weight, which is a function of firm’s 𝑖 characteristics at time 𝑡, 𝐼𝑡,𝑖.

In the adversarial GMM, the first neural network is generating the function 𝑤𝑡,𝑖 for each i.

63

Adversarial GMM

To switch from the conditional expectation to the unconditional expectation we

multiply moment conditions by a function measurable with respect to time 𝑡

Ε 𝑀𝑡+1𝑅𝑡+1,𝑖𝑒 𝑔(𝐼𝑡,𝑖) = 0

This equality should hold for any function 𝑔.

The function 𝑔 is the second or adversarial neural network.

Each output of 𝑔 is in fact a moment condition

In the adversarial approach, the moment conditions are those that lead to the

largest mispricing

min𝑤

max𝑔

1

𝑁

𝑗=1

𝑁1

𝑇

𝑡=1

𝑇

1 −

𝑖=1

𝑁

𝑤 𝐼𝑡,𝑖 𝑅𝑡+1,𝑖𝑒 𝑅𝑡+1,𝑗

𝑒 𝑔 𝐼𝑡,𝑗

2

After convergence, we can construct time-series observations for SDF using 𝑤and excess returns.

64

LSTM LSTM networks are well-suited for making predictions based on time-series observations, or more

generally, based on a sequence of data.

For instance, speech recognition, handwriting recognition, and time-series predictions are all based on a sequence of date.

The LSTM unit transforms a time-series input into a time-series output.

The dimensions of the inputs and outputs could be identical or different.

The architecture of LSTM unit is composed of four parts, (i) cell, (ii) input gate, (iii) output gate, and (iv) forget gate.

A cell – the memory part of the LSTM unit (c).

Three regulators (“gates”) that control for the flow of information inside the LSTM unit:

Input gate (i)

Output gate (o)

Forget gate (f)

The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the gate.

The regulators are trained to capture the important signals in the time-series, which can appear with lags of unknown duration of important events in the time-series.

The cell has the ability to forget part of its previously stored memory and to add part of the new information.

65

Ct−1 Ct

htht−1

ft it ሚ𝐶tot

LSTM

Here is a schematic figure1 for an LSTM unit

𝑥𝑡 is the input consisting of time series observations

ℎ𝑡 is the output time series

𝑖𝑡, 𝑜𝑡, and 𝑓𝑡 are the regulators

෩𝐶𝑡 is an intermediate value

for the memory (c) part.

At time t, the information set

consists of 𝐶𝑡−1, ℎ𝑡−1, 𝑥𝑡.

1 The figure is taken from https://colah.github.io/posts/2015-08-Understanding-LSTMs/66

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTM

In turn, LSTM is a serial processing procedure.

Later, we will introduce the attention mechanism, a parallel (versus sequential) processing routine with expanding information set involving all memory cells.

The functions are defined as follows

𝑓𝑡 = 𝜎𝑔 𝑊𝑓𝑥𝑡 + 𝑈𝑓ℎ𝑡−1 + 𝑏𝑓𝑖𝑡 = 𝜎𝑔 𝑊𝑖𝑥𝑡 + 𝑈𝑖ℎ𝑡−1 + 𝑏𝑖ሚ𝐶𝑡 = tanh 𝑊𝑐𝑥𝑡 + 𝑈𝑐ℎ𝑡−1 + 𝑏𝑐

𝐶𝑡 = 𝑓𝑡 ∘ 𝐶𝑡−1 + 𝑖𝑡 ∘ ሚ𝐶𝑡𝑜𝑡 = 𝜎𝑔 𝑊𝑜𝑥𝑡 + 𝑈𝑜ℎ𝑡−1 + 𝑏𝑜

ℎ𝑡 = 𝑜𝑡 ∘ tanh 𝐶𝑡

tanh 𝑥 =𝑒2𝑥−1

𝑒2𝑥+1is the Hyperbolic tangent (between -1 and 1), 𝜎𝑔 𝑥 =

1

1+𝑒−𝑥is the sigmoid function

(between 0 and 1), and ∘ is an element-by-element product. The tanh function could be replaced by a sigmoid.

Variables: 𝑥𝑡 ∈ ℝ𝑑is the intput vector, 𝑓𝑡 ∈ ℝℎ is the forget gate, 𝑖𝑡 ∈ ℝℎ is the input update gate, 𝑜𝑡∈ ℝℎ is the ouptut gate, ሚ𝐶𝑡 (𝑐𝑡) ∈ ℝℎ is the cell input (state) vector, 𝑊_𝑠 ∈ ℝℎ×𝑑 and 𝑈_𝑠 ∈ ℝℎ×𝑑 are weight matrices, and b_s ∈ ℝℎ are intercept vectors, all of which are learnt during the training stage, and d is the pre-specified number of input feature.

67

LSTM

Larger 𝑓𝑡 means forget less, which amounts to giving higher weights to previous cell

values.

Larger 𝑖𝑡 means give higher weight to current cell values, on the account of previous

cells.

The sum of 𝑓𝑡 and 𝑖𝑡 is not one, so they are not explicit weights, albeit like weights. `

h is the single hyper parameter standing for the number of hidden units and output

dimension.

Hyper-parameters in LSTM are determined through a validation sample in supervised

learning (see application on the next page).

We will also study reinforced learning with as well as without LSTM. Because

reinforced learning is not supervised, there is no validation sample. Instead, hyper-

parameters are determined in the test sample by experimenting on different values.

68

LSTM

𝐶0 and ℎ0 are initial values, can be set to zero.

For example,

Let firm i have M characteristics at time t, denoted by 𝑥𝑡𝑖

.

At time t we use the most recent K periods to predict the next period return Ƹ𝑟𝑡+1𝑖

.

The input is the series 𝑥𝑡−𝐾+1𝑖

, … , 𝑥𝑡𝑖

while, the final output, ℎ𝑡𝑖

generates the

prediction Ƹ𝑟𝑡+1𝑖

.

LSTM parameters are estimated by minimizing the loss function

ℒ =

𝑖,𝑡

ℎ𝑡𝑖− 𝑟𝑡+1

𝑖2

In that setup, it is assumed that there is only one hidden unit; the output is a single

number.

If ℎ𝑡𝑖

is a vector of length h, the output is given by ෨ℎ𝑡𝑖= 𝑊ℎℎ𝑡

𝑖, while ෨ℎ𝑡

𝑖is replacing ℎ𝑡

𝑖

in the loss function above. That only amounts to expanding the parameter space.

69

Transformer Encoder in Finance

Treating financial assets as multiple sequences in the studies of reinforcement learning is pioneered by Wang, Zhang, Tang, Wu, and Xiong (2019) and Cong, Tang, Wang, and Zhang (2020).

The authors of these papers coin a new terminology – the alpha portfolio.

Their implementation builds on Transformer.

So, what is Transformer?

Transformer is a deep learning model (multiple layers) based solely on attention.

Transformer is an encoder-decoder model developed by Google team for the purpose of natural language processing (NLP), such as translations of sentences.

For instance, the translation of German to English is done in two steps: Encoder transforms German to meta-language and the decoder translates the meta-language to English.

Previously, we discussed recurrent neural networks (RNN), such as LSTM, that are also designed to handle sequential data.

However, RNN differs in many aspects from Transformer.

Transformer uses an attention mechanism: for each element in the input sequence, it measures its relevance versus other input elements.

Transformer can process the data in parallel, unlike RNN in which the processing is sequential.

This parallelization allows Transformers to be trained more effectively on larger datasets.

Due to the parallel nature of the transformer, each element in the input should be manipulated so that it includes an indicator for its location in the sequence.

70

Transformer Encoder in Finance

In their asset pricing implementation, Cong et al use solely the encoder part of the Transformer.

The input is the time series of stock characteristics.

First, for each stock its time series characteristics are encoded.

Second, the encoded data is used to model the cross sectional interaction between the stocks (CAAN).

The cross sectional interactions are modeled by a self-attention mechanism, more on that soon.

The interactions are later translated into scores and weights for constructing the portfolio that would

result in the highest ex-post Sharpe ratio.

The authors also implement a model using LSTM for the time series encoding while keeping the cross

section interactions modeling with the attention mechanism (CAAN).

Using LSTM for the time series encoding results in better performance.

Let us dive into the routines.

71

Transformer Encoder (TE) in Finance

The first part of the analysis consists of time series encoding.

We start with firm i characteristics in K past months

𝑋 𝑖 = 𝑥1𝑖, … , 𝑥𝑘

𝑖, … , 𝑥𝐾

𝑖

These observations are encoded by the transformer encoder into hidden states

𝑍 𝑖 = 𝑧1𝑖, … , 𝑧𝑘

𝑖, … , 𝑧𝐾

𝑖

It is a sequence-to-sequence encoding for each firm separately.

Later, we explain in detail the encoding mechanism – the transition from 𝑋 𝑖 to 𝑍 𝑖

The dimension of the input is K=12, namely, a one year of historical data while there are 51 firm

characteristics in the original papers.

Therefore, 𝑥𝑘𝑖

is a vector of length 𝑑 = 51.

Similarly, 𝑧𝑘𝑖

has the same dimension as the input.

72

Transformer Encoder (TE) in Finance

TE is a sequence of four operations as described on the figure on top:

1. Multi-head attention unit

2. Add & Norm is a regularization layer.

3. Feed forward network

4. Add & Norm is a regularization layer.

We start with the multi-head attention unit as described on the bottom figure.

The multi-head attention unit is a parallel connection of four self-attention units.

The number of self-attention units in a multi-head attention unit is a hyper-

parameter.1

73 1 In the original setup there are 8 self-attention units in a multi-head attention. see “Attention Is All You Need”, A. Vaswani, N. Shazeer, N. Parmar, J.

Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, arXiv:1706.03762.

https://arxiv.org/search/cs?searchtype=author&query=Vaswani,+A

https://arxiv.org/search/cs?searchtype=author&query=Shazeer,+N

https://arxiv.org/search/cs?searchtype=author&query=Parmar,+N

https://arxiv.org/search/cs?searchtype=author&query=Uszkoreit,+J

https://arxiv.org/search/cs?searchtype=author&query=Jones,+L

https://arxiv.org/search/cs?searchtype=author&query=Gomez,+A+N

https://arxiv.org/search/cs?searchtype=author&query=Kaiser,+L

https://arxiv.org/search/cs?searchtype=author&query=Polosukhin,+I

https://arxiv.org/abs/1706.03762

Transformer– Self attention

The self attention module is often called Scaled Dot-Product attention.

Its aim is to detect interrelations between the sequence elements.

It performs the several mathematical operations as summarized in the figure on this page and are

detailed in the following.

Let the input sequence for form i be denoted by the set of vectors 𝑋 𝑖 = 𝑥1𝑖, … , 𝑥𝑘

𝑖, … , 𝑥𝐾

𝑖

The self-attention unit constructs for each vector 𝑥𝑘𝑖

three representation vectors: query - 𝑞𝑘𝑖

, key -

𝑘𝑘𝑖

and value - 𝑣𝑘𝑖

.

These vectors are constructed using the trainable matrix weights 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉

𝑞𝑘𝑖= 𝑊 𝑄 𝑥𝑘

𝑖

𝑘𝑘𝑖= 𝑊 𝐾 𝑥𝑘

𝑖

𝑣𝑘𝑖= 𝑊 𝑉 𝑥𝑘

𝑖

The lengths of 𝑞𝑘𝑖

and 𝑘𝑘𝑖

must be equal. Their size is a hyper-parameter. The size of 𝑣𝑘𝑖

is also a

hyper-parameter, say 𝑑2 not necessarily equal to 𝑑1. In the paper implementation 𝑑2 = 𝑑1

𝑊 𝑄 𝑊 𝐾 are both 𝑑1 × 𝑑 matrices while 𝑊 𝑉 is a 𝑑2 × 𝑑 matrix.

74


The interrelationship between firm i characteristics at different times in the historical

period say 𝑘, 𝑘′ ∈ 1,… , 𝐾 is modeled by the dot product of query, 𝑞𝑘𝑖

and key, 𝑘𝑘′𝑖

𝛽𝑘,𝑘′ =𝑞𝑘

𝑖 ′𝑘𝑘′

𝑖

𝑑1

This interrelationship is the attention weight.

The matrices 𝑊 𝑄 and 𝑊 𝐾 are in general not of equal length; this allows the

interrelationship to be nonsymmetrical. 𝑑1 is a scale parameter which is the length of

the two vectors.

Calculating attention score for time k is a softmax of the interrelationship of word i and

j multiplied by word j value

𝑎𝑘𝑖=σ𝑘′=1𝐾 exp 𝛽𝑘,𝑘′ 𝑣𝑘′

𝑖

σ𝑘′=1𝐾 exp 𝛽𝑘,𝑘′

where K is the number of periods in the input sentence, and 𝑎𝑘𝑖

is the attention score

which is a vector of length 𝑑2.

75


A reminder: softmax is defined as follows

softmax 𝑧𝑖 =exp 𝑧𝑖

σ𝑖=1𝐼 exp 𝑧𝑖

; 𝑖 = 1, … , 𝐾

The output of the self attention process can be written in matrix notation as follows:

𝑍 = softmax𝑋′𝑊 𝑄 ′𝑊 𝐾 𝑋

𝑑1𝑋′𝑊 𝑉

𝑍 is a 𝐾 × 𝑑2 matrix.

𝑍 is the hidden states or put differently, the encoded firm characteristics result from a

single self-attention unit.

76

Transformer– Multi Head Attention

In order to catch a number of complex interrelations in a sequence a group of self-

attention units is grouped and connected in parallel.

This connected group is termed multi-head attention.

The multi-head attention unit is a group of four1 self attention units each with different

trainable matrices weights 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 . Of course the number four is a hyper-

parameter. Each self attention unit has an output 𝑍ℎ , ℎ = 1,… , 4 which is a 𝐾 × 𝑑2matrix. All those matrices are concatenated into a new matrix.

෩𝑍 = 𝑍1, … , 𝑍4 that is of dimension 𝐾 × 4𝑑2

Then ෩𝑍 is multiplied from the right by a weight matrix 𝑊𝑜 of

dimension 4𝑑2 × 𝑑.

We will soon see why the numbers of columns of 𝑊𝑜 is 𝑑.

The output of the multi-head attention is 𝐾 × 𝑑.

1 In the original setup there are 8 self-attention units in a multi-head attention. see “Attention Is All You Need”, A. Vaswani, N. Shazeer, N. Parmar, J.

Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, arXiv:1706.03762.77










Transformer Encoder - Summery

A summery of the transformer encoder architecture is illustrated in the

figure1 on this slide.

Another ingredient in the architecture is the residual connection, the block

marked by “Add & Norm”. It sums the input of the previous sub-layer (Multi

head attention or FFN) to its output and normalize the sum

LayerNorm(෪𝑍 + 𝑋)

LayerNorm means reducing the mean and dividing by the standard

deviation.

Since ෩𝑍 and 𝑋 must have the same dimension 𝑊𝑜 has 𝑑 columns.

The FFN is a single layer fully connected network with 1024 neurons. The

FFN has a weight matrix of dimension 1024 × 51, bias vector of length 1024

and another weight matrix of dimension 51 × 1024 and bias vector of length

51.

This sums up the time series encoding. For each firm i we get the hidden

states 𝑍 𝑖 = 𝑧1𝑖, … , 𝑧𝑘

𝑖, … , 𝑧𝐾

𝑖which is a matrix of dimension 12 × 51.

1 source: “Attention Is All You Need”, A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, arXiv:1706.0376278










Cross asset attention network (CAAN) Up to this stage we have transformed the real observations 𝑋 𝑖 into hidden states 𝑍 𝑖 . An alternative way to

make that transformation is by using LSTM where the 𝑍 𝑖 are replaced by the cell states.

The cross sectional interrelations among firms hidden states are modeled by a self attention module called

cross asset attention network (CAAN). This is the same unit like the one we saw in the multi-head attention.

All the hidden states vectors for firm i, 𝑍 𝑖 = 𝑧1𝑖, … , 𝑧𝑘

𝑖, … , 𝑧𝐾

𝑖are concatenated into a vector

𝑦 𝑖 = 𝐶𝑜𝑛𝑐𝑎𝑡 𝑧1𝑖, … , 𝑧𝑘

𝑖, … , 𝑧𝐾

𝑖

𝑦 𝑖 is a vector with length 𝑑 ∗ 𝐾 that represents firm i.

Concat is similar to the vec operator.

Namely, the representation 𝑦 𝑖 is used to construct three other representation vectors: query - 𝑞 𝑖 , key - 𝑘 𝑖

and value - 𝑣 𝑖 . Those vectors are constructed by using the trainable matrix weights 𝑊CAAN𝑄

, 𝑊CAAN𝐾

, 𝑊CAAN𝑉

during the training process

𝑞 𝑖 = 𝑊CAAN𝑄

𝑦 𝑖

𝑘 𝑖 = 𝑊CAAN𝐾

𝑦 𝑖

𝑣 𝑖 = 𝑊CAAN𝑉

𝑦 𝑖

The matrices for the query, key and value are different from those in the encoder stage therefore we added the

subscript CAAN.

𝑊CAAN𝑄

, 𝑊CAAN𝐾

are 𝑑3 × 𝑑 ∗ 𝐾 matrices and 𝑊CAAN𝑉

is a 𝑑4 × 𝑑 ∗ 𝐾 matrix. Where 𝑑3 = 𝑑4 = 256.79

Transformer Encoder

The interrelationship sometimes also called attention weight, of stock j and stock i is

modeled by the dot product of stock i query, 𝑞 𝑖 and stock j key,

𝛽𝑖,𝑗 =𝑞 𝑖 ′𝑘 𝑗

𝑑3

Calculating attention score for stock i is a softmax of the interrelationship of stock i

and j multiplied y stock j value

𝑎 𝑖 =σ𝑗=1𝐼 exp 𝛽𝑖,𝑗 𝑣 𝑗

σ𝑗=1𝐼 exp 𝛽𝑖,𝑗

where I is the over all number of stocks and 𝑎 𝑖 is a vector with length 𝑑3.

Finally, the winner score determining the long short position of stock i in the portfolio is

𝑠 𝑖 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑊 𝑆 𝑎 𝑖 + 𝑏 𝑆

where 𝑊 𝑆 is a 1 × 𝑑3 weight matrix and 𝑏 𝑆 is a 𝑑3 bias vector, both are learnt during

training.

80

Reinforcement Learning Optimization

Given the scores of each stock 𝑠 1 , … , 𝑠 𝑖 , … 𝑠 𝐼 the long and short

portfolios are constructed by the G extreme scores.

Let 𝑜 𝑖 be the score of stock i in a descending order.

Stock i is in the long portfolio 𝑏+ if 𝑜 𝑖 ∈ 1, 𝐺

𝑏+ 𝑖 =𝑒𝑥𝑝 𝑠 𝑖

σ𝑜 𝑖′ ∈ 1,𝐺

𝑒𝑥𝑝 𝑠 𝑖′

Stock i is in the short portfolio 𝑏− if 𝑜 𝑖 ∈ 𝐼 − 𝐺 + 1, 𝐼

𝑏− 𝑖 =𝑒𝑥𝑝 −𝑠 𝑖

σ𝑜 𝑖′ ∈ 𝐼−𝐺+1,𝐼

𝑒𝑥𝑝 −𝑠 𝑖′

Let denote by 𝑏𝑐 the vector of weights for all the I stocks.

81

Reinforcement Learning

“Learning what to do – how to map situations to actions – so as to

maximize a numerical reward signal” (Sutton and Barto, 2008)

The characteristics 𝑋𝑡𝑖= 𝑥1

𝑖, … , 𝑥𝑘

𝑖, … , 𝑥𝐾

𝑖for the past K periods for

𝑖 = 1,… , 𝐼 is the state of environment.

The action at time t is the stock weights in the constructed portfolio 𝑏𝑡𝑐.

The reward at time t is the portfolio return at t+1: 𝑟𝑡+1𝑝

= 𝑟𝑡+1’ 𝑏𝑡𝑐

The value is the Sharpe ratio for a sequence of realized returns, say 12

months

𝐽 = 𝑆𝑅 𝑟1𝑝, 𝑟2

𝑝, … , 𝑟12

𝑝

82

Reinforcement Learning

Let us denote by 𝜃 the model parameters that affect 𝑏𝑡𝑐, then the optimization is

to find 𝜃∗ such that

𝜃∗ = argmax𝜃 𝐽 𝜃

To summarize

𝜃 = 𝑊CAAN𝑄

,𝑊CAAN𝐾

,𝑊CAAN𝑉

,𝑊 𝑆 , 𝑏 𝑆 , 4 × 𝑊 𝑄 ,𝑊 𝐾 ,𝑊 𝑉 ,𝑊𝑜

The collection of hyper-parameters, denoted by 𝑑, 𝑑1, 𝑑2, 𝑑3, 𝑑4, ℎ, 𝐾, 𝐺 , isdetermined in the test sample.

The investor is price taker; hence, his action does not affect the evolution of the environment.

Another reinforcement learning techniques and applications in economics, game theory, operational research and finance can be found in Charpentier, Elie, and Remlinger (2020)

83

CAAN parameters TE parameters

Sparse versus non sparse representations This research controversy crosses disciplines that deal with model uncertainty

and variable selection.

Zou and Hastie (2005) advocate in favor of sparse representation.

Fu (1998) advocates using general cross-validation to select the shrinkage

parameter (q) and the tuning parameter (𝜆) in a bridge regression setup and

shows that bride regression performs well. The bridge regression could produce

non sparse solutions.

In asset pricing, Freyberger, Neuhierl, and Weber (2017) use the adaptive group

LASSO to select characteristics and to estimate how they affect expected returns

non-parametrically.

This study adopts Lasso-style estimation with 𝑙1-penalty and they suggest a

relatively high degree of redundancy among equity predictors.

84

Sparse versus non sparse representations In contrast, Kozak, Nagel, and Santosh (2017b) advocate against sparse solutions.

Informed Bayes priors on all model coefficients suggest non sparsity.

Notice that factor models imply sparse solution, yet a caveat is in order.

Take the present-value relation - it can indeed motivate why book-to-market and

expected profitability could jointly explain expected returns.

However, the notion that expected profitability is unobserved gives license to fish

a large number of observable firm characteristics that predict future returns

through their ability to predict future profitability.

85

From Chaos to Zoo… The remaining pages of these notes is a presentation of the paper: “Machine

learning versus economic restrictions: evidence from stock return predictability.”

To start, John Cochrane points out in his presidential address (AFA 2011) that in

the beginning there was chaos.

Practitioners thought that one only needed to be clever to earn high returns, but

then came the CAPM.

Every clever strategy that delivered high average return ended up delivering high

market beta.

Then anomalies (size, value, momentum, profitability, investment) erupted, and

there was chaos again.

To address the “zoo” of variables, Cochrane suggests, we should consider different

methods, namely, beyond cross section regressions and portfolio sorts.

86

Zoo of Anomalies

87

Source: Harvey, Liu, and Zhu (2016)

Challenging the multi-dimensional challenge

Harvey, Liu, and Zhu (2016): 296 anomalies, 27% to 53% are likely to be false

discoveries

Hou, Xue, and Zhang (2018): 452 anomalies, 82% turn insignificant upon

excluding microcaps + value-weighting

Anomaly profits are mostly attributable to the short leg of the trade (Stambaugh,

Yu, and Yuan, 2012)

Anomalies tend to concentrate in distressed stocks (Avramov, Chordia, Jostova,

and Philipov, 2013 and 2018)

Anomalies attenuate, and often disappear, in recent years (Chordia,

Subrahmanyam, and Tong, 2014)

88

Machine learning in asset pricing

Counter to this ‘anomaly-challenging’ strand of literature, there has

been a fast growing body of work that reports outstanding

investment profitability based on signals from machine learning

methods.

Reduced form predictability

Freyberger, Neuhierl, and Weber (2018); Feng, Polson, and Xu

(2019); Gu, Kelly, and Xiu (2019a)

Predictability with asset pricing restrictions

Lettau and Pelger (2018); Chen, Pelger, and Zhu (2019); Feng,

Giglio, and Xiu (2019); Gu, Kelly, and Xiu (2019b); Kelly,

Pruitt, and Su (2019); Kozak, Nagel, and Santosh (2020)

89

So, papers on machine learning methods have been motivated by the

Cochrane’s call for new methods.

What needs to be done?

Address the high-dimension of noisy and correlated predictors

Utilize flexible, possibly non-linear, functional forms

Implement model selection

Mitigate overfitting biases through regularization

Machine Learning: automated detection of complex patterns in data;

combine multiple, possibly weak, sources of information into a

meaningful composite signal

90

Machine learning – motivation

Do ML methods clear standard economic restrictions in asset pricing?

Cross section: Exclude difficult-to-value and arbitrage stocks (microcaps or distressed stocks).

Time series: Is ML based predictability manifested through market states associated with alleviated limits to arbitrage (e.g., high VIX, high illiquidity)?

Asses turnover and potential impact of trading costs.

Examine whether the tangency portfolio based on estimating SDF through machine learning is admissible.

Assess the economic grounds for the seemingly opaque ML methods.

We primarily focus on the four machine learning methods and also consider the shrinkage approach advocated by Kozak, Nagel, and Santosh (2020).

91

Research Questions

Neural network with 3 hidden layers (NN3) (Gu, Kelly, and

Xiu 2019a)

Example: 1 hidden layer with 5 neurons

(4 + 1) × 5 + 6 = 31 parameters

92

Machine Learning Method I: GKX

Neural network with 3 hidden layers (NN3) (Gu, Kelly, and

Xiu 2019a)

32, 16, and 8 neurons per layer

Reduced form, no economic restriction

94 firm characteristics + 74 industry dummies + 8

macroeconomic predictors + interactions

(8+1) × 94 + 74 = 920 predictors

Training sample: 18 years, 1957 to 1974

Validation sample: 12 years, 1975 to 1986

Out-of-sample test: 31 years, 1987 to 2017

93

Machine Learning Method I: GKX

Multiple connected neural networks (Chen, Pelger, and Zhu 2019)

Incorporate no-arbitrage conditions to estimate SDF and stock risk

loadings

94

Machine Learning Method II: CPZ

Multiple connected neural networks (Chen, Pelger, and Zhu

2019)

46 firm characteristics + 178 macroeconomic predictors +

interactions




95

Machine Learning Method II: CPZ

Instrumented principal component analysis (IPCA) (Kelly,

Pruitt, and Su 2019)

𝑟𝑖,𝑡+1 = 𝛼𝑖,𝑡 + 𝛽′𝑖,𝑡𝑓𝑡+1 + 𝜖𝑖,𝑡+1

𝛽′𝑖,𝑡 = 𝑧𝑖,𝑡′ Γ𝛽 + 𝜐′𝛽,𝑖,𝑡

Factor loadings vary with predictive characteristics linearly

Incorporate zero-alpha restriction

94 firm characteristics

Estimated in each month


96

Machine Learning Method III: IPCA

Conditional Autoencoder (CA) (Gu, Kelly, and Xiu 2019b)

Example: 1 hidden layer with 3 neurons (factors)

97

Machine Learning Method IV: CA

Conditional Autoencoder (CA) (Gu, Kelly, and Xiu 2019b)

32 and 16 neurons per layer, five latent factors

Factor loadings vary with predictive characteristics non-linearly

through neural networks

Incorporate zero-alpha restriction

94 firm characteristics




98

Machine Learning Method IV: CA

99

Summary of Machine Learning Methods

CRSP: daily and monthly stock data

COMPUSTAT: quarterly and annual financial statement data

GKX: all NYSE/AMEX/Nasdaq stocks, set missing

characteristics to cross-sectional median

21,882 stocks, between 5,117 and 7,877 per month

CPZ: all U.S. stocks from CRSP with available data on firm

characteristics

7,904 stocks, between 1,933 and 2,755 per month

100

Data

Cross-sectional return predictability is concentrated in

microcaps and distressed firms

Microcaps: market cap smaller than the 20th NYSE size

percentile

Rated firms: firms with data on S&P long-term issuer credit

rating

Distressed firms: lower rated firms that further undergo

deteriorating credit conditions.

101

Economic Restrictions

102

Sub-samples with Economic Restrictions: GKX

103

Sub-samples with Economic Restrictions: CPZ

104

•Long (Short) stocks with the highest (lowest) NN3-predicted returns, monthly rebalanced decile portfolios

GKX Portfolio Return Spread: EW

105

•VW performance is 47% lower than EW.

GKX Portfolio Return Spread: EW vs. VW

GKX Portfolio Return Spread: EW vs. VW

106

•Exclude microcaps: 48% lower than the full sample; Rated firms: 46% ↓; Exclude

downgrades: 70% ↓

CPZ Portfolio Return Spread: EW vs. VW

107

•Long (Short) stocks with the highest (lowest) risk loadings on SDF,

monthly rebalanced decile portfolios

•VW performance is 43% lower than EW.

CPZ Portfolio Return Spread: Economic Restrictions

108

•Exclude microcaps: 62% lower than the full sample; Rated firms: 72% ↓; Exclude downgrades: 64% ↓

IPCA Portfolio Return Spread: IPCA vs. GKX vs. CPZ

109• IPCA underperforms GKX and CPZ in the full sample.

IPCA Portfolio Return Spread: Economic Restrictions

110• IPCA: no material deterioration of performance in subsamples with economic restrictions

CA Portfolio Return Spread: Economic Restrictions

111•Exclude microcaps: 48% lower than the full sample; Rated firms: 75% ↓; Exclude downgrades: 94% ↓

Characteristics of VW ML Portfolios

112

ML methods: smaller maximum drawdown than the market; positive return during the crisis period

Sharpe Ratio Skewness Excess KurtosisMaximum

DrawdownReturn in Crisis Turnover

Panel A: Sorted by NN3-Predicted Return

Full Sample 0.944 0.631 5.222 0.350 4.100 0.976

Non-Microcaps 0.644 0.361 7.062 0.349 3.563 0.869

Panel B: Sorted by Risk Loading

Full Sample 1.225 1.063 5.932 0.209 0.472 1.664

Non-Microcaps 0.839 0.326 1.582 0.246 0.677 1.625

Panel C: Sorted by IPCA-Predicted Return

Full Sample 0.967 -0.449 4.805 0.203 0.574 1.186

Non-Microcaps 0.978 -0.267 5.369 0.234 1.493 1.130

Panel D: Sorted by CA2-Predicted Return

Full Sample 0.784 -0.077 2.418 0.202 -0.047 1.565

Non-Microcaps 0.748 0.291 4.684 0.207 -0.529 1.478

Panel E: Market Portfolio

Full Sample 0.527 -0.978 3.323 0.486 -6.954 0.089

Non-Microcaps 0.530 -0.959 3.222 0.485 -6.907 0.086

Characteristics of VW ML Portfolios

113

ML methods: require high turnover in portfolio rebalancing → reduce the gross returns based on GKX (CPZ, IPCA, CA)

method by at least 0.43% (0.81%, 0.56%, 0.74%)

Sharpe Ratio Skewness Excess KurtosisMaximum

DrawdownReturn in Crisis Turnover

Panel A: Sorted by NN3-Predicted Return

Full Sample 0.944 0.631 5.222 0.350 4.100 0.976

Non-Microcaps 0.644 0.361 7.062 0.349 3.563 0.869

Panel B: Sorted by Risk Loading

Full Sample 1.225 1.063 5.932 0.209 0.472 1.664

Non-Microcaps 0.839 0.326 1.582 0.246 0.677 1.625

Panel C: Sorted by IPCA-Predicted Return

Full Sample 0.967 -0.449 4.805 0.203 0.574 1.186

Non-Microcaps 0.978 -0.267 5.369 0.234 1.493 1.130

Panel D: Sorted by CA2-Predicted Return

Full Sample 0.784 -0.077 2.418 0.202 -0.047 1.565

Non-Microcaps 0.748 0.291 4.684 0.207 -0.529 1.478

Panel E: Market Portfolio

Full Sample 0.527 -0.978 3.323 0.486 -6.954 0.089

Non-Microcaps 0.530 -0.959 3.222 0.485 -6.907 0.086

GKX Portfolio Return Spreads: Exclude Microcaps + VW

114

ML methods: require high turnover in portfolio rebalancing → reduce the gross returns based on GKX method by at

least 0.43%

CPZ Portfolio Return Spreads: Exclude Microcaps + VW

115

ML methods: require high turnover in portfolio rebalancing → reduce the gross returns based on CPZ method by at

least 0.81%

IPCA Portfolio Return Spreads: Exclude Microcaps + VW

116

ML methods: require high turnover in portfolio rebalancing → reduce the gross returns based on IPCA method by at

least 0.56%

CA Portfolio Return Spreads: Exclude Microcaps + VW

117

ML methods: require high turnover in portfolio rebalancing → reduce the gross returns based on CA method by at least

0.74%

CPZ: estimate SDF for individual stocks

Kozak, Nagel, and Santosh (2020) (KNS): estimate SDF for equity

portfolios, i.e., long-short portfolio return based on predictive

characteristics

Minimize the Hansen-Jagannathan (1991) distance

Ridge regression with three-fold cross-validation

Apply the 94 characteristics in GKX

In-sample estimation: 1964 to 2004

Out-of-sample test: 2005 to 2017

An alternative ML Method

118

CAPM FF6Sharpe

Ratio

SDF-Implied MVE Portfolio Weights

Mean 10% 25% Median 75% 90%

Full Sample 3.662*** 3.338*** 2.318 0.083 -1.994 -0.912 0.341 0.964 1.687

(6.01) (5.90)

Non-Microcaps 1.543*** 0.895*** 0.977 0.084 -0.592 -0.238 0.072 0.407 0.647

(3.88) (2.87)

Credit Rating Sample 1.418*** 0.717* 0.898 -0.006 -0.382 -0.137 -0.003 0.187 0.326

(2.97) (1.93)

Non-Downgrades 1.308*** 0.545 0.828 -0.022 -0.370 -0.217 0.004 0.135 0.293

(2.92) (1.59)

Characteristics of SDF-Implied MVE Portfolios

119

• Imposing economic restrictions reduces performance, as well as odds of extreme positions

•Deep learning techniques face the usual challenge of cross-sectional predictability: profitability evolves from difficult-to-arbitrage stocks + sizable trading costs

• .. and we have not yet addressed the time-series channel

Binding limits to arbitrage → more profitable anomaly-based trading strategies

High sentiment, high volatility, and low liquidity

Time-Varying Return Predictability: GKX FF6 Alpha

120

𝐻𝑀𝐿𝑡 = 𝛼0 + 𝛽1𝐻𝑖𝑔ℎ 𝑆𝐸𝑁𝑇𝑡−1 + 𝛽2𝐻𝑖𝑔ℎ 𝑀𝐾𝑇𝑉𝑂𝐿𝑡−1+ 𝛽3𝐻𝑖𝑔ℎ 𝑀𝐾𝑇𝐼𝐿𝐿𝐼𝑄𝑡−1 + 𝛽4𝑀𝑡−1 + 𝑐′𝐹𝑡 + 𝑒

𝑡

Time-Varying Return Predictability: GKX and CPZ

121

𝐻𝑀𝐿𝑡 = 𝛼0 + 𝛽1𝐻𝑖𝑔ℎ 𝑆𝐸𝑁𝑇𝑡−1 + 𝛽2𝐻𝑖𝑔ℎ 𝑀𝐾𝑇𝑉𝑂𝐿𝑡−1+ 𝛽3𝐻𝑖𝑔ℎ 𝑀𝐾𝑇𝐼𝐿𝐿𝐼𝑄𝑡−1 + 𝛽4𝑀𝑡−1 + 𝑐′𝐹𝑡 + 𝑒

𝑡

Time-Varying Return Predictability: IPCA and CA

122

Sorted by IPCA-Predicted Return Sorted by CA2-Predicted Return

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8

Constant 0.501 0.286 0.175 0.083 0.361 0.246 -0.825 -0.743

(1.55) (0.88) (0.35) (0.15) (0.69) (0.47) (-0.81) (-0.67)

High SENT 0.572* 0.758** 0.158 0.277 0.698 0.708 0.465 0.489

(1.74) (2.17) (0.49) (0.82) (1.27) (1.26) (1.04) (1.01)

High MKTVOL 0.069 0.143 -0.080 0.076

(0.22) (0.52) (-0.15) (0.13)

High VIX 0.447 0.303 0.228 0.189

(1.27) (0.83) (0.40) (0.33)

High MKTILLIQ 0.246 0.245 0.130 0.258 0.930 1.196** 0.491 0.822

(0.70) (0.65) (0.41) (0.70) (1.65) (2.00) (0.79) (1.24)

Controls N N Y Y N N Y Y

• IPCA and CA: trading profits do not significantly vary with market states.

Return Predictability in Recent Years: GKX Full Sample

123

•Unlike individual anomalies, GKX strategy remains viable in recent years.

Return Predictability in Recent Years: GKX Non-Microcaps

124

•Unlike individual anomalies, GKX strategy remains viable in recent years.

Return Predictability in Recent Years: Non-Microcaps

125

•Most ML strategies remain viable in recent years.

Stock Characteristics of ML Portfolios

126

All ML methods identify stocks in line with most anomaly-

based trading strategies.

Long positions: small, value, illiquid and old stocks with low

price, low beta, high 11-month return, low asset growth, low

equity issuance, low credit rating coverage, and low analyst

coverage.

Robust to sub-periods and market states

Stock Characteristics of ML Portfolios

127

Exceptions: long stocks with high corporate investment and

high idiosyncratic volatility

Negative investment-return relation among firms with higher

cash flows and lower debt ratios

Negative (positive) idiosyncratic volatility-return relation among

overpriced (underpriced) stocks

Intra-Industry vs. Inter-Industry Return Predictability

128

• Intra-industry strategy accounts for the majority of the unconditional profit → informs on stock selection not industry rotation

Intra-Industry vs. Inter-Industry Return Predictability

129

• Intra-industry strategy improves performance, especially on non-microcaps.

Robustness Test: Exclude Microcaps in NN3

130

•NN3: full sample estimation

•NN3-EX: exclude microcaps in estimation

Robustness Test: Value-Weighted Loss Function

131

•NN3: equal-weighted loss function, predict return

•NN3-VW: value-weighted loss function, predict FF6 alpha

Potential of ML in Asset Management

132

Mitigate the downside risk and hedge against crisis

Remain profitable in recent years

Profitable in long positions: e.g., GKX signal, exclude microcaps + VW

Conclusion

133

Investments based on deep learning signals extract profitability

primarily from difficult-to-arbitrage stocks and during high

volatility and illiquidity market states.

Despite their opaque nature, ML methods identify mispriced

stocks consistent with most anomalies.

Beyond economic restrictions, ML signals are profitable in long

positions, remain viable in recent years, and command low

downside risk.

References

134

Avramov, D., T. Chordia, G. Jostova, and A. Philipov, 2013, Anomalies and financial distress. Journal of Financial Economics 108:139–159.

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen, 2014, Inference on treatment effects after selection among high-dimensional controls, The Review of Economic Studies 81, 608-650.

Borisenko, Dmitry., 2019, Dissecting Momentum: We Need to Go Deeper. Working paper.

Charpentier, A., Elie R. and Remlinger C., 2020 Reinforcement Learning in Economics and Finance, arXiv2003.10014v1

Chen, L., M. Pelger, and J. Zhu , 2019 , Deep learning in asset pricing. Working Paper.

Chordia, T., A. Subrahmanyam, and Q. Tong , 2014 , Have capital market anomalies attenuated in the recent era of high liquidity and trading activity? Journal of Accounting and Economics 58:51–58.

Cochrane, John H., 2011, Presidential Address: Discount Rates, The Journal of Finance 66, 1047-1108.

Cong L. W., Tang K., Wang J., and Zhang Y, 2020 AlphaPortfolio for Investment and Economically Interpretable AI, working paper

Feng, Guanhao, Stefano Giglio, and Dacheng Xiu, 2017, Taming the factor zoo, Working Paper.

Feng, G., S. Giglio, and D. Xiu , 2019 , Taming the factor zoo. Forthcoming in Journal of Finance.

Frank, I.E. and Friedman, J.H. (1993) An Statistical View of Some Chemometrics Regression Tools, Technometrics 35, 109-135.

Freyberger Joachim, Andreas Neuhierl, and Michael Weber, 2017, Dissecting characteristics nonparametrically, Working Paper. 135

136

Freyberger, J., A. Neuhierl, and M. Weber , 2018 , Dissecting characteristics nonparametrically. Working

Paper.

Fu, Wenjiang J., 1998, Penalized regressions: the bridge versus the lasso, Journal of computational and

graphical statistics 7, 397-416.

Gu, S., Kelly, B., Xiu, D., 2019, Autoencoder Asset Pricing Models, Forthcoming in Journal of

Econometrics.

Gu, S., B. Kelly, and D. Xiu , 2019 , Empirical asset pricing via machine learning. Working Paper.

Harvey, C. R., Y. Liu, and H. Zhu , 2016, ...and the cross-section of expected returns. Review of

Financial Studies 29:5–68.

Hoerl, Arthur E., and Robert W. Kennard, 1970, Ridge regression: Biased estimation for nonorthogonal

problems, Technometrics 12, 55-67.

Hoerl, Arthur E., and Robert W. Kennard, 1970, Ridge regression: applications to nonorthogonal

problems, Technometrics 12, 69-82.

Hou, K., C. Xue, and L. Zhang , 2018, Replicating anomalies. Forthcoming in Review of Financial

Studies.

137

Huang, J., J. L. Horowitz, and F. Wei, 2010, Variable Selection in Nonparametric Additive Models,

Annals of statistics 38, 2282-2313.

Kelly, B., Pruitt S., and Su Y., 2019, Characteristics are covariances: a unified model of risk and return,

Forthcoming in Journal of Financial Economics.

Kelly, B., Pruitt S., and Su Y., 2017, Instrumental principal component analysis, working papers,

Chicago Booth and ASU WP Carey.

Kozak, Serhiy, Stefan Nagel, and Shrihari Santosh, 2018, Interpreting factor models, The Journal of

Finance 73, 1183-1223.

Kozak, Serhiy, Stefan Nagel, and Shrihari Santosh, 2018, Shrinking the cross section, NBER Working

Paper.

Kozak, S., S. Nagel, and S. Santosh , 2019 , Shrinking the cross-section. Forthcoming in Journal

of Financial Economics.

Lettau, M., and M. Pelger. 2018a. Estimating latent asset-pricing factors. Forthcoming in Journal

of Econometrics.

Lettau, M., and M. Pelger. 2018b. Factors that fit the time series and cross-section of stock returns.

Working Paper

Luyang, Chen., Markus, Pelger†., and Jason Zhu, 2019, Deep Learning in Asset Pricing. Working paper.

138

McLean, R., and J. Pontiff 2016, Does academic research destroy stock return predictability? Journal of

Finance 71:5–31.

Nair, Vinod, and Geoffrey E Hinton, 2010, Rectified linear units improve restricted boltzmann machines,

in Proceedings of the 27th international conference on machine learning (ICML-10), 807–814

Pástor, Ľuboš, and Robert F. Stambaugh, 2000, Comparing asset pricing models: An investment

perspective. Journal of Financial Economics 56 (3): 335-81.

Pástor, Ľuboš. 2000. Portfolio selection and asset pricing models. The Journal of Finance 55 (1): 179-

223

Stambaugh, R. F., J. Yu, and Y. Yuan , 2012 , The short of it: Investor sentiment and anomalies. Journal

of Financial Economics 104:288–302.

Wang, Jingyuan, Yang Zhang, Ke Tang, Junjie Wu, and Zhang Xiong, 2019, Alphastock: A buying-

winners-and-selling-losers investment strategy using interpretable deep reinforcement attention

networks, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge

Discovery & Data Mining pp. 1900-1908.

Zou, Hui, and Trevor Hastie, 2005, Regularization and variable selection via the elastic net, Journal of

the Royal Statistical Society: Series B (Statistical Methodology) 67, 301-320.

Machine learning Methods in asset pricingpluto.huji.ac.il/~davramov/Machine_learning.pdf · Machine...

Documents

Transcript of Machine learning Methods in asset pricingpluto.huji.ac.il/~davramov/Machine_learning.pdf · Machine...