Combining multivariate Markov chains

Combining multivariate Markov chains

Jesus E. Garcıa

In this paper we address the problem of modelling multivariatefinite order Markov chains, when the dataset is not large enough toapply the usual methodology.

The curse of dimensionality in model selection formultivariate stochastic processes

I Consider the alphabet B = {1, 2, ..., |B|}I For an order 1 Markov Chain on B the set of parameters is

the set of transition probabilities, {p(s, b) : b, s ∈ B} that is|B|2 parameters, considering that 1 =

∑b p(s, b), the number

of parameters to estimate is |B|(|B| − 1).

I For an order o Markov chain over the alphabet B, the set oftransition probabilities is {p(s, b) : s ∈ Bo , b ∈ B}corresponding to |B|o(|B| − 1) parameters.

I For an order o k-variate Markov chain over the alphabetA = Bk , the set of transition probabilities is{p(s, a) : s ∈ {Bk}o , a ∈ Bk} corresponding to|B|ok(|B|k − 1) parameters.


I For an order o k-variate Markov chain over the alphabetBk , we need to fit |B|ok(|B|k − 1) parameters

I The number of parameters needed for a multivariateMarkov chain grows exponentially with the process orderand the dimension of the chain’s alphabet.

I The size of the dataset needed to fit multivariateMarkov chain grows exponentially with the process order andthe dimension of the chain’s alphabet.

I Given a dataset, the set of multivariate Markov chainmodels from where to choose a model for the dataset islimited by the curse of dimensionality.


I In general during model selection, we can’t reduce thedimension of the alphabet.

I When the data set is not large enough for the number ofparameters of the “true” model, the order of the fitted modelwill be smaller than the order of the “true” model from whichthe sample was produced.

I In this work we introduce a new strategy to estimate a modelfor a multivariate process. That allows the estimation of amodel with greater order than the standard procedure.

I The family of models used for the model selection procedureis the family of partition Markov models.

Partition Markov modelsNotation

Let (Xt) be a discrete time order o Markov chain

I A the finite alphabet;

I S = Ao the state space;

I P(a|s) = Prob(Xt = a|X t−1t−o = s), a ∈ A, s ∈ S the

transition probabilities.

Equivalence relationship on S

Definitionfor s, r ∈ S; s ∼p r ⇐⇒ P(a|s) = P(a|r) ∀a ∈ A.

I For any s ∈ S, the equivalence class of s is given by[s] = {r ∈ S|r ∼p s}.

I The classes defined by ∼p are the subsets of S with the sametransition probabilities.

Equivalence relationship on S

I The equivalence relation defines a partition L of S.I We have (|A| − 1) transition probabilities for each “part”

(element of L), obtaining a model with (|A| − 1)|L|parameters.

I The elements of S on the same equivalence class activate thesame random mechanism to choose the next element in theMarkov chain.

Markov chain with partition L

Definitionlet (Xt) be a discrete time, order o Markov chain on A and letL = {L1, L2, . . . , LK} be a partition of S. We will say that (Xt) isa Markov chain with partition L if this partition is the one definedby ∼p .

Example

I A = {0, 1}, o = 2,

I S(= Ao) = {00, 01, 10, 11},I Assume that,

P(0|00) = P(0|01) = 0.4 P(0|10) = P(0|11) = 0.2;

I P(1|s) = 1− P(0|s) ∀s ∈ S;

I the partition for this Markov chain is L = {{00, 01}, {10, 11}}with parts L1 = {00, 01} and L2 = {10, 11};

I the parameters of the Markov chain with partition L areP(0|L1) = 0.4 and P(0|L2) = 0.2.

Model selection problem

Given a sample generated by a finite memory stationary process,how to choose a partition defining a good Markov model for thesource?

Notation

Let xn1 be a sample of the process

(Xt

), s ∈ S, a ∈ A and n > o.

Nn(s, a) =∣∣{t : o < t ≤ n, x t−1

t−o = s, xt = a}∣∣, (1)

Nn(s) =∣∣{t : o < t ≤ n, x t−1

t−o = s}∣∣. (2)

To simplify the notation we will omit the n on Nn.

A distance in S

DefinitionWe define the distance d in S ,

d(s, r) =2

(|A| − 1) ln(n)

∑a∈A

{N(s, a) ln

(N(s, a)

N(s)

)+ N(r , a) ln

(N(r , a)

N(r)

)− (N(s, a) + N(r , a)) ln

(N(s, a) + N(r , a)

N(s) + N(r)

))for any s, r ∈ S .

A distance in S

TheoremFor any s, r , t ∈ S ,

i. d(r , s) ≥ 0 with equality if and only if N(s,a)N(s) = N(r ,a)

N(r) ∀a ∈ A,

ii. d(r , s) = d(s, r),

iii. d(r , t) ≤ d(r , s) + d(s, t).

Remarkd can be generalized to subsets (see Garcıa, J. andGonzalez-Lopez, V. A. (2010)).

Consistence in the case of a Markov source

TheoremLet (Xt) be a discrete time, order o Markov chain on a finitealphabet A. Let xn

1 be a sample of the process and 0 < α <∞,then for n large enough and for each s, r ∈ S , dn(r , s) < α iff sand r belong to the same class.

AlgorithmInput: d(s, r)∀s 6= r ∈ S ; Output: Ln.B = S ; Ln = ∅while B 6= ∅

select s ∈ B

define Ls = {s}

B = B \ {s}

for each r ∈ B, r 6= s

if d(s, r) < α

Ls = Ls ∪ {r}B = B \ {r}

Ln = Ln ∪ {Ls}

Return: Ln = {L1, L2, . . . , LK}

Notation for the k-variate process

Let Xt be the state of the k dimensional Markov process at time t,

I A = Bk .

I Xt = (X (1)t , ...,X (k)t) will be the value of the k-dimensionalsource at time t.

I Xt ∈ A,

I X (i)t ∈ B is the value of the coordinate number i at time t.

k-variate copula

I B = {1, 2, ..., |B|} a finite set.

I For 1 ≤ i ≤ k , X (i) is a random variable taking values on theset B.

I (X (1), ...,X (k)) is a random vector with joint probabilityfunctionp(b1, ..., bk) = P(X (1) = b1, ...,X (k) = bk), b1, ...bk ∈ B.

I For 1 ≤ i ≤ k , X (i) have probability pi (b) = P(X (i) = b)and cumulative distribution Fi (b) = P(X (i) ≤ b),

I For u = (u1, ..., uk), with 0 ≤ ui ≤ 1, for i = 1, 2, ..., |B|.The k-variate copula density is given byc(u1, . . . , uK ) = p(b1,...,bk )

p1(b1)...pk (bk ) .

where bi is choosen such that F (bi − 1) ≤ ui ≤ F (bi ) for1 ≤ i ≤ k . And F (0) = 0.

k-variate copula

The function c(u1, . . . , uK ) satisfies the following characteristics

I It is a probability mass function, displayed in [0, 1]K .

I the univariate marginal distributions are U(0, 1) and

I the cumulative distribution C of c verifiesProb(X (1) ≤ b1, . . . ,X (k) ≤ bk) =C(F1(b1), . . . ,Fk(bk)

), for all (b1, . . . , bk) ∈ Bk .

Example: 2-variate copula

Consider k = 2, B = {1, 2}, the copula density is

c(u, v) =

p(1,1)p1(1)p2(1) , if (u, v) ∈ [0,F1(1)[×[0,F2(1)[p(1,2)

p1(1)p2(2) , if (u, v) ∈ [0,F1(1)[×[F2(1), 1]p(2,1)

p1(2)p2(1) , if (u, v) ∈ [F1(1), 1]× [0,F2(1)[p(2,2,)

p1(2)p2(2) , if (u, v) ∈ [F1(1), 1]× [1,F2(2)]

0 otherwise.

Mixed Markov Partitions procedure

I We will assume that for 1 ≤ i ≤ k, X (i)t is an order omMarkov chain, with om <∞.

I The marginal state space is Bom .

I For each s ∈ Bom and b ∈ B,Pi (b|s) = Prob(X (i)t = b|X (i)t−1

t−om = s).

Mixed Markov Partitions procedure, first part

On the first part of our procedure we fit a model for themultivariate process,

I Divide the dataset in two parts.

I Use the first half of the dataset to fit a PMM to themultivariate process Xt with a maximum order equal to oc .

I Call Loc , the partition of Aoc corresponding to the fittedmodel.

I Extend the partition Loc , to a partition of Aom denoted by Pcin the following way. If Loc = {Loc

1 , ..., Locmc}, then

Pc = {Lc1, ..., L

cmc}, with, Lc

j = ∪s∈Locj {w .s : w ∈ A(om−oc )}, 1 ≤ j ≤ mc .

Where ”.” denotes the concatenation between strings.

Mixed Markov Partitions procedure, second part

On the second part of our procedure we fit a model for eachmarginal process, using the second half of the dataset.

I Divide the second half of the dataset into k independentsubsets of equal length.

I Fit a PMM to each marginal process using the correspondingsubset of the dataset.

I For i = 1, 2, ..., k let Li = {Li1, ..., L

imi} be the partition of

Bom corresponding to the model fitted to the marginal processX (i)t .

I From the collection of partitions {L1, ...,Lk} define thefollowing partition of Aom .

Pm = {L1j1 × ...× Lk

jk: 1 ≤ j1 ≤ m1, ..., 1 ≤ jk ≤ mk}.

Mixed Markov Partitions procedure, third part

On the third part of our procedure we build a new partition fromthe joint and marginal partitions fitted on the first and second partofthe procedure.

I Build the partition P of Aom combining Pc and Pm. P as therefinement of the partitions Pm and Pc . Corresponding to thefollowing equivalence relationship in Aom ,

s ∼ r if ∃L ∈ Pm and ∃L′ ∈ Pc such that s, r ∈ L ∩ L′.

I Two states s and r belong to the same part of P if and only ifthey belong to the same part of both Pm and Pc .

Transition probabilities estimation using copula theory

Given s ∈ Aom and a ∈ A, we will show how to compute P(a|s).

I Let w be the size oc suffix of s, that means s = q.w for anappropriated string q.

I Consider the estimator Pc(a|w) for the joint probabilityPc(a|w) of the process of order oc , obtained from the firststep of our procedure.

I I For 1 ≤ i ≤ k , let s(i) be the sequence in Bom , that is thesequence consisting of the concatenation of elements of s inthe coordinate i .

I Denote by Pi (a(i)|s(i)) the estimate of the marginal probabilityfrom the i process of order om, obtained from the second partof our procedure, where a(i) is the i-coordinate of a.

I Denote by Fi (a(i)|s(i)) the corresponding marginal distributionfunction.

Transition probabilities estimation using copula theoryThe two set of probabilities are combined in the following way,

I Define a k-dimensional copula distribution C ((u1, ...uk)|w)from the joint probabilities Pc(a|s), following, for example theidea presented in [2]. There is more than one way of choosingcopula distributions in the case of discrete random variables,see [13].

I Evaluate the copula distribution on the marginal distributions,as follows

P(a|s) = C(

(F1(a(1)|s(1)), ..., Fk(a(k)|s(k)))|w).

I It is easy to check that for any L ∈ P if s, r ∈ L thenP(a|s) = P(a|r), ∀a ∈ A.

I In the approach proposed in this paper the number ofparameters to estimate is (|A| − 1)|Loc |+

∑ki=1(|B| − 1)|Li |

which gives a notion of computational complexity of theprocedure.

Conclusions

I In this paper, we show a procedure to fit an approximatedPMM to a multivariate process, when the dataset is not largeenough to apply the usual methodology.

I The procedure combines individual PMM from the marginalprocesses and a PMM model for the joint process, this lastwith an order smaller to the real one.

I We show how to combine the corresponding partitions on aunique partition and how to combine the probabilities of allthe models to a unique set of transition probabilities, for thefitted model.

I This methodology can be modified to be used with othersfamilies of Markov models, such as the fixed and variablelength Markov chains, for which there exists several modelselection methods (see [14], [1] and [4]).

Acknowledgments

This article was produced as part of the activities of FAPESPCenter for Neuromathematics (grant 2013/ 07699-0 , S.PauloResearch Foundation).The authors acknowledge the support provided by USP project“Mathematics, computation, language and the brain and FAPESPproject “Portuguese in time and space: linguistic contact,grammars in competition and parametric change.”

Bibliography

Csiszar, I. and Talata, Z., “Context tree estimation for not necessarily finite memory processes, via BIC and MDL”, IEEE Trans.

Inform. Theory 52, 1007–1016, 2006.

Fernandez, M. and Gonzalez-Lopez, V. A., ”A Bayesian approach for convex combination of two Gumbel-Barnett copulas” In

American Institute of Physics Conference Series, vol. 1558, pp. 1491-1494. 2013.

Fernandez, M. and Gonzalez-Lopez, V. A., ”A copula model to analyze minimum admission scores ” In American Institute of

Physics Conference Series, vol. 1558, pp. 1479-1482. 2013.

Galves, A., Galves, C., Garcia, J. E., Garcia, N. L. and Leonardi, F., “Context tree selection and linguistic rhythm retrieval from

written texts”, Annals of Applied Statistics, 6 1, 186 – 209, 2012.

Garcia, Jesus E., and Fernandez, M., ”Copula based model correction for bivariate Bernoulli financial series.” In American

Institute of Physics Conference Series, vol. 1558, pp. 1487-1490. 2013.

Garcia, J. and Gonzalez-Lopez, V. A., ”Independence tests for continuous random variables based on the longest increasing

subsequence”. Journal of Multivariate Analysis, 127, 126-146. 2014.

Garcia, J. and Gonzalez-Lopez, V. A., ”Minimal Markov Models”, arXiv:1002.0729v1, 2010.

Garcia, J. and Gonzalez-Lopez, V. A., ”Detecting Regime Changes In Markov Models”, Proceedings of The Sixth Workshop on

Information Theoretic Methods in Science and Engineering, 2013.

Garcia, J. and Gonzalez-Lopez, V. A., ”Modeling of acoustic signal energies with a generalized Frank copula. A linguistic

conjecture is reviewed ”, Communications in Statistics - Theory and Methods 43 (10-12), 2034–2044, 2013.

Garcia, J. E., Gonzalez-Lopez, V. A., and Nelsen, R. B., ”A new index to measure positive dependence in trivariate distributions”.

Journal of Multivariate Analysis, 115, 481-495. 2013.

Garcia, J. E., Gonzalez-Lopez, V. A. and Viola, M. L. L., ”Robust model selection and the statistical classification of languages”,

AIP Conference Proceedings 1490, 160 – 170. 2012.

Garcia, J. E., Gonzalez-Lopez, V. A. and Viola, M. L. L., ”Robust Model Selection for Stochastic Processes”, Communications in

Statistics-Theory and Methods 43 (10-12), 2516-2526. 2014.

Joe, H., Multivariate models and dependence concepts (Vol. 73), CRC Press, 1997.

Rissanen J., ”A universal data compression system”, IEEE Trans. Inform. Theory 29(5), 656 – 664, 1983.

Combining multivariate Markov chains

Documents

Transcript of Combining multivariate Markov chains