Estimating Covariance Using Factorial Hidden Markov Modelsjoao/presentations/PGMO_talk.pdf ·...

university-of-pennsylvania-logo.png

Estimating Covariance Using Factorial Hidden MarkovModels

Joao Sedoc1,2

with: Jordan Rodu3, Lyle Ungar1, Dean Foster1 and Jean Gallier1

1University of PennsylvaniaPhiladelphia, PA

[email protected]

2Chivalric Trading

3Carnegie Mellon UniversityPittsburg, PA

PGMO Conference, October 29, 2014

Joao Sedoc Estimating Covariance Using Factorial Hidden Markov Models 1 / 42

Outline

1 MotivationWhat’s Novel?Portfolio OptimizationNon-Stationary Covariance

2 Introduction to Factorial HMMsHMM Application to ProblemsQuick Overview of Hidden Markov ModelsEstimationFactorial HMM

3 Empirical Results

4 Conclusion

Outline

3 Empirical Results

4 Conclusion

What’s Novel- Innovations to Factorial HMMs

Multiple time horizon HMM using a structured approach

Incorporation of high frequency data

Estimation in near real time

Continuous emission HMM

Provable bounds

Incorporation of exogenous data

Provable bounds

Outline

3 Empirical Results

4 Conclusion

What’s Novel- Application to Portfolio Optimization

Markowitz optimization is a well know theory, but hard to do right

The allocation is optimized under exponential utility

argmaxαpos

PTαpos −1

2ζαT

posΣαpos

where αpos is the notional allocation,p is the asset price at time t ,P t = E[pt+τ |pt ] is the expected profit,

Σ is the asset return covariance matrix, andζ is the risk aversion free variable.

In this talk we will only focus on improving covariance estimation

We want a better estimate of Σ→ Σt

argmaxαpos

PTαpos −1

2ζαT

posΣαpos

argmaxαpos

PTαpos −1

2ζαT

posΣαpos

argmaxαpos

PTαpos −1

2ζαT

posΣαpos

Drawbacks of Current Models

Modern approaches are constrained by computational complexity

Trade off between model richness and data richness

Difficult to both explain and identify the model

Incorporation of exogenous data is often difficult in empirical models

Outline

3 Empirical Results

4 Conclusion

S&P 500 realized variance

Figure: S&P 500 variance (second resolution)

S&P 500 and 30 Year Treasury realized covariance

Figure: S&P 500 and 30 Year Treasury covariance (second resolution)

Outline

3 Empirical Results

4 Conclusion

Common Applications of Hidden Markov Models

Gene recognition

Robotics

Natural language processing tasks

Speech Recognition

Outline

3 Empirical Results

4 Conclusion

Hidden Markov Models

There are two primary assumptions for this basic HMM:

1 The underlying hidden state process is Markovian

2 Given the hidden states, the observations are independent

Figure: HMM with states ht , ht+1, and ht+2 that emit observations xt , xt+1, andxt+2 respectively.

The probability distribution over the next hidden state at time t + 1depends only on the current hidden state at time t

Pr(ht+1 | ht , . . . , h1) = Pr(ht+1 | ht).

The Hidden Markov Model parameters

T =Pr(ht+1|ht = i)

Collection of λ(x)′s

Pr(xt+1|ht+1)

Pr(h1)

Figure: Pictorial view of HMM parametersJoao Sedoc Estimating Covariance Using Factorial Hidden Markov Models 16 / 42

The likelihood of a sequence of observations from a specified model is

Pr(x1, . . . , xt) =∑

h1,...,ht

[π]h1

t∏j=2

[T ]hj ,hj−1

t∏j=1

[λ(xj)]hj

though we will not consider this particular form of the likelihood. Instead,we will look at a “new” form for the likelihood,

Pr(xt , . . . , x1) = 1>A(xt) · · ·A(x1)π

where λ(x) is the distribution of the observation given a hidden state, and

A(xt) = Tdiag(λ(x)).

A(x) =

λ(x) = Pr(x |h)

Pr(ht+1, x |ht = 1)

Figure: A(x), graphically

Outline

3 Empirical Results

4 Conclusion

Spectral Methods for Estimation

Spectral methods use singular value decomposition (SVD) andmethod of moments.

Fast SVD instead of forward/backward method EM estimation.

Computing observables for spectral estimation of an HMM, fullyreduced third moment.

Estimation speed is critical given the size of high frequency financialdatasets.

For US equities sampling per second yields roughly 5 million datapoints per year per stock!

Spectral Algorithm Sketch

Calculate E [X2 ⊗ X1].

Calculate fast SVD of E [X2 ⊗ X1] keeping k left singular vectors.

Reduce the data where y = U>x .

Compute the first three moments E [Y1],E [Y2⊗Y1],E [Y3⊗Y1⊗Y2].

In the discrete case,

Pr(xt , . . . , x1) = b∞B(yt) · · ·B(y1)b1

where B(y) is the similarity transform of A(x).

Generalization to the Continuous Case

To generalize to the continuous case we need to take expectations where,

Pr(xt , . . . , x1) = b∞B(G (xt)) · · ·B(G (x1))b1

and G (x) is an estimate of E [Y2 | x1].

B(G (x)) is exactly what we want, up to a constant factor depending on x .

Outline

3 Empirical Results

4 Conclusion

Factorial HMM

Different state layers evolve differently

Figure: Factorial HMM diagram

Factorial HMM

Figure: Structured Factorial HMM diagram

Structured Factorial HMM Differences

Improvements

Faster estimation using Spectral methods

Intuition about time horizon

Simple layer aggregation

Drawbacks

Jumps in covariance estimation at hourly boundaries

Heuristic choice of time horizon

Requires lots of data

Improvements

Drawbacks

Improvements

Drawbacks

Improvements

Drawbacks

Improvements

Drawbacks

Improvements

Drawbacks

Stock Covariance

Model Horizon RMSE N training N out of sample

CAPM daily 0.9 ∗ 10−5 3125 1000

CAPM hourly 1.2 ∗ 10−7 ∼ 40000 ∼ 4000

CAPM second 1.7 ∗ 10−8 ∼ 4000000 ∼ 400000

PCA (1) daily 0.85 ∗ 10−5 3124 1000

PCA (1) hourly 1.0 ∗ 10−7 ∼ 40000 ∼ 4000

PCA (1) second 1.6 ∗ 10−8 ∼ 4000000 ∼ 400000

GARCH daily 0.6 ∗ 10−5 3124 1000

GARCH hourly 0.9 ∗ 10−7 ∼ 40000 ∼ 4000

GARCH second 1.2 ∗ 10−8 ∼ 4000000 ∼ 400000

FHMM daily 1.2 ∗ 10−6 3124 1000

FHMM hourly 3.0 ∗ 10−7 ∼ 40000 ∼ 4000

FHMM second 0.9 ∗ 10−9 ∼ 4000000 ∼ 400000

(1) 15 principal components

Summary

Major Contributions

Multiple time frames

Richer model

Intuitive explanation of model

Fast estimation

Summary

Major Contributions

Richer model

Fast estimation

Summary

Major Contributions

Richer model

Fast estimation

Summary

Major Contributions

Richer model

Fast estimation

Thanks for listening!

Future Work

Empirical frequency selection

Expansion to other datasets (energy / weather)

Better estimation on lower time horizons

Test more distributions for G(x)

Future Work

For Further Reading I

Spectral Algorithm for Learning Hidden Markov Models Hsu, Kakade,Zhang 2009Finding structure with randomness: Probabilistic algorithms forconstructing approximate matrix decompositions. Halko, Martinsson, Tropp2011Using Regression for Spectral Estimation, Foster, Rodu, Ungar, Wu 2013Two Step CCA: A new spectral method for estimating vector models ofwords, Dhillon, Foster, Rodu, Ungar 2013Spectral Dependency Parsing with Latent Variables, Collins, Dhillon, Foster,Rodu, Ungar 2012Spectral Dimensionality Reduction for HMMs, Foster, Rodu, Ungar 2012

Papers and Projects In Progress

Spectral Estimation of HMMs with a continuous output distribution,Foster, Rodu, Ungar (in progress)Spectral Estimation of hierarchical HMMs, Foster, Rodu, Sedoc, Ungar (inprogress)

Appendix

In this section we will describe how to build the observables B(x).First note that the first three moments of the data from an HMM yieldthe following theoretical form:

E [X1] = Mπ

E [X2 ⊗ X1] = MT diag(π) M>

E [X3 ⊗ X1 ⊗ X2] = MT diag(λ(x)) T diag(π) M>

where in this particular settingX1 is Pr(Σt−1), X2 is Pr(Σt), X3 is Pr(Σt+1),π is the initial state vector, andM is the expected value of x given hidden state i .

Spectral Algorithm Sketch

Calculate E [X2 ⊗ X1].

Calculate fast SVD of E [X2 ⊗ X1] keeping k left singular vectors.

Reduce the data where y = U>x .

Compute the first three moments E [Y1],E [Y2⊗Y1],E [Y3⊗Y1⊗Y2].

Consider an U such that U>M is invertible,then estimating the second and third moments with reduced datay = U>x allows in the discrete case,

B(x) ≡ E [Y3 ⊗ Y1 ⊗ Y2](λ(x))E [Y2 ⊗ Y1]−1

= (U>M)T diagλ(x)(U>M)−1.

Generalization to the Continuous Case

To generalize to the continuous case we need to take expectations where,

B(G (x)) = (U>M)Tdiagλ(x)(U>M)−11

where Pr(x) is the marginal probability, andG (x) is a function of E [Y2 | x1].B(G (x)) is exactly what we want, up to a constant factor depending on xas

Pr(Y1, . . . ,Yt) ≡ b>∞ B(G (xt)) · · · B(G (x1)) b1.

Outline

Continuous Emission HMM

Define g(x) ≡ E [Y2|x1].Let ht be the probability vector associated with begin in a particular stateat time t. Then

E [y2|h2] = U>Mh2.

E [h2|h1] = Th1.

E [y2|h1] = U>MTh1

Continuous HMM Emission

To establish a belief about h1 given x1, recall from Bayes formula

Pr(h1|x1) =Pr(x1|h1) Pr(h1)

Pr(x1)

We can arrange each probability into a vector, and because in the indicatorvector case the probability vector is the same as the expected value vector,we have, in vector notation

E [h1|x1] =diagπλ(x)

π>λ(x)

and so putting together the pieces we get

E [y2|x1] =U>MTdiagπλ(x)

π>λ(x)

Continuous HMM Emission

Recall that the goal is to isolate λ(x). Note that

E [y2 ⊗ y1]−1g(x) =(M>U)−1λ(x)

π>λ(x)

≡ G (x)

When this is plugged into our fully reduced version of B(γ), we get

B(G (x)) = (U>M)TdiagM>UG (x)(U>M)−1

= (U>M)Tdiagλ(x)(U>M)−11

where Pr(x) is the marginal probability. B(G (x)) is exactly what we want,up to a constant factor depending on x .

Spectral Estimation Algorithm

Algorithm 1 Computing observables for spectral estimation of an HMM,fully reduced third moment

1: Input: Training examples- x (i) for i ∈ {1, . . . ,M} where x (i) = x(i)1 , x

(i)2 , x

(i)3 .

2: Compute E [x2 ⊗ x1] = 1M

∑mi=1 x

(i)2 x

(i)>1 .

3: Compute the left k eigenvectors corresponding to the top k eigenvalues of Σ.Call the matrix of these eigenvectors U.

4: Reduce data: y = U>x .

5: Compute µ = 1M

∑Mi=1 y

(i)1 , Σ = 1

∑Mi=1 y

(i)2 y

(i)>1 and tensor C =

∑Mi=1 y

(i)3 ⊗ y

(i)1 ⊗ y

(i)2 .

6: Set b1 = µ and b>∞ = b>1 Σ−1

7: Right multiply each slice of the tensor in the y2 direction (so y2 is being sliced

up, leaving the y3y>1 matrices intact) by Σ−1 to form B(γ) = C (γ)Σ−1

Similarity Transform from A(x) to B(x)

Unfortunately, A(x) isn’t directly learnable. However an appropriatesimilarity transformation of A(x) (of which there are more than one) islearnable by the method of moments, bypassing the need to recover theHMM parameters, and still gets us what we want. Note that

P(x1, . . . , xt) = 1> A(xt) · · · A(x1) π

= 1>S−1︸︷︷︸b>∞

SA(xt)S−1︸︷︷︸

S · · · S−1 SA(x1)S−1 Sπ︸︷︷︸b1

≡ b>∞ B(xt) · · · B(x1) b1

Markowitz Optimization

Given a vector of current prices pt and unknown future prices Ptτ themarket value is

Ψα = αTpos (Pt+τ − pt) (1)

Assuming that the market is Gaussian, the price distribution is

Pt+τ ∼ N (µ,Σ) (2)

Therefore the distribution of the portfolio is

Ψα ∼ N(αT

pos (µ− pt) ,αTposΣαpos

The allocation is optimized under exponential utility, having risk-aversionparameter ζ, and the certainty equivalent by the quadratic program QP

argmaxαposCE(αpos) = PTαpos −

2ζαT

posΣαpos (4)

where P is the expected profit, roughly defined as P t = E[Pt+τ |pt ].Numeric optimizers seek to minimize, define the objective function asf (α) ≡ −CE(α).

Estimating Covariance Using Factorial Hidden Markov Modelsjoao/presentations/PGMO_talk.pdf ·...

Documents

Transcript of Estimating Covariance Using Factorial Hidden Markov Modelsjoao/presentations/PGMO_talk.pdf ·...