Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La...

Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole Evry France [email protected] Colloque T.A.G – LAPTH Annecy – 8-10 novembre 2006

Transcript of Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La...

Page 1: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Analysis of biological sequences using Markov Chains

and Hidden Markov ModelsBernard PRUM,

La genopole – Evry – France

[email protected]

Colloque T.A.G – LAPTH

Annecy – 8-10 novembre 2006

Page 2: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Why Markov Models ?

A biological sequence :

X = (X1, X2, … , Xn)

where Xk A = { t , c , a , g } or {A C D E F G H I K L M N P Q R S T V W Y}

A very common tool for analyzing these sequences is the Markov Model (MM)

P(Xk = v | Xj , j < k) = P(Xk = v | Xk – 1) u, v A

denoted by π(u , v) if Xk – 1 = u

Page 3: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Why MM ? – 2Exemple :

E. coli Rec BCD


own bacteria genomechi

A complex, called Rec BCD, protects the cell against viruses

To avoid the destruction of the genome of the cell, along the genome exists a password gctggtgg (it is called chi). When rec BCD bumps into the chi, it stops its destruction. In order to be efficient the number of occurrences of the chi is much higher that the number predicted in a Markov model.

Page 4: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Results MM

Page 5: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Parsimonious Markov Models

When we modelize a sequence in order to find exceptional motifs or for annotation, we have to estimate the parameters of the model, and more parameters we have, worst is the estimation.

In a Markov Model of order m, there are 4m predictors (the m-words), hence 3 x 4m parameters

In the M2 model, there are 16 predictors and 48 parameters

In M5, there are 1024 predictors, 3072 parameters

Page 6: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

PMM – 2

A first restriction consists in taking into account the past up to the point : we use a large past when the sequence shows that this is necesary, we use a short past when the sequence allows the economy : these models are called

VLMC = Variable Length Markov Chains

In this VLMC, there are 12 predictors :aa ca ga taac cc gc tcgat tt [gc]t

There are 36 parameters

Notation : [gc] denotes « g or c » ; [act] denotes « a or c or t »

Page 7: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

PMM – 3

But it is not obvious that for the prediction of Xk , Xk – p is less and less informative inasmuch p increases.

As an example (*), let us consider the ’jumper’ model

P(Xt = v | past) = P(Xt = v | Xt – 2)

(the dependance ‘jumps’ over Xt – 1)

it corresponds to this tree(4 predictors, 12 parameters)

(*) this model is not as scholar as it seems : for example in a coding region (periodic model depending on the phase), the 2nd position in a codon strongly depends on the 2nd position in the previous codon (cf hydrophobicity)

Page 8: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

PMM – 4

In this PMM there are 8 predictors (24 parameters) :

a[ac] c[ac] g[ac] t[ac]gat [cg]t tt

More general (?) example :

These models are calledPMM = Parsimonious Markov Models

Page 9: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

PMM – 5

More precisely : in the tree of predictors (*) below any node all the partitions of A = { t , c , a , g } may appear

(*) : the differents predictors appear in this tree like the path from all the leaves to the root

Hence, there are 15 possibilities below each node.

Page 10: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

PMM – 6

A Parsimonious Markov Model (PMM) is defined by• such a dependance tree • for each leaf (= for each predictor) a law on A

P(Xt = u | Xt – 1 = [tc])

P(Xt = u | Xt – 1 = a, Xt – 2 = c)

P(Xt = u | Xt – 1 = a, Xt – 2 = [tag])

P(Xt = u | Xt – 1 = g)


Page 11: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

PMM – 7

We will only work with finite order PMM : the longer predictor contains, say, m letters (the depth of the tree is m)

Obviously a PMM of order m is a MM of order m

Note : the number of PMM increases very quickly with m : in the 4-letter alphabet and for m = 5 there are some 1085


Notations : denotes a tree of predictors

W its sets of predictors in (the leaves)

For w W , w,u = P(Xt = u | w)


Page 12: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Statistics on PMM

L() = … ∏ w,u N(wu)

For a fixed tree , the likelihood is obviously

(The dots correspond to the first letters in the sequence. We will not care about them today)

Which leads to the classical MLE

w,u =N(wu)N(w+)

^(where N(w+) = ∑ N(wv)

The difficulty arises when we want to choose the tree : problem of

choice of model

(within, for example 1085 models)

Page 13: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Statistics – 2

Therefore we adopt a Bayesian approach

A priori law :• on the tree let us choose the uniform law (it can be changed)• on the transition parameter, it is natural to chose a Dirichlet law which is conjugate :

if, for w W , a priori P(w,•) = ∏ w,u

then, a posteriori P(w,•) = ∏ w,u



The MAP estimator of w,u remains the same as before, except the fact that N(w,u) has to be changed in

N’(wu) = N(wu) + (w, u)

Page 14: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Statistics – 3

The use of Bayes formula then gives as a posterior law on the trees

ln P( | X) = S(w)

Where the sum is taken over all the predictors in the tree and

S(w) = ln (N(wu)) – ln N(wu))

( is such as (k+1) = k ! , k N)

Writing the posterior law in this way shows that P( | X) may be maximized in a recursive way

Page 15: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Application to real genomes

We fitted MM and PMM for the orders m = 3 , 4 and 5• on the set of the 224 complete bacterial genomes

published today• on their coding regions (CDS)

To compare the adequacy of this modelizations, we computed

the BIC criterion for each model MBIC(M) = 2 L(M) - nb_param(M) . ln n

“ The higher BIC, the better the model ”

Page 16: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Application – 2

This picture plots BIC(PMM) – BIC(MM) against the size of the bacterial genome. For all the bacteriae, PMM fits better than classical MM

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

Page 17: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Approach using FDA

Recent results (Gregory Nuel) concern the use of «Finite (Deterministic) Automata» in the statistic of words or patterns

To a word, we may associate an FDA :Example 1 : on {a,b}, w = aaabStates : b






This can be generalizedif “one“ word w is replaced bya motif (finite family of words) or even a language.

Page 18: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Approach using FDA

This automata is especially dedicated to the study of the word w (the motif, ...) : if we “run“ a sequence on this graph, the automate counts the occurences of w (the motif, ...)

It turns to be VERY efficient :“wordcount “, program in EMBOSS, needs 4352 seconds to count the occurrences of all 12-words in E. coli, Nuel’s program acheives this task in 9.86 seconds

The prosite motif


(some 1012 words) is treated by a FDA of 329(30) states in M01393(78) states in M1

Page 19: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Approach using FDA

If this sequence X is a Markov chain, we then have an other MC running on this graph.

Even for “rather complicated motifs“, this allows to get the law of “all“ statistics of words : - exact law of the first occurrence of a motif (taking into account the “starting point“),

- exact law of the number of occurrences of the motif,- in particuler expectation and variance of these laws,

opening the possibility of gaussian, poisonnian,... approximations(and an exhaustive study of the qualities of these approximation),

- law of a motif M conditionally to the number of occurrences of another one, M’.

Page 20: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Hidden Markov Models

2nd Part :

Page 21: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Hidden Markov Models

An important criticism against Markov modelization is its stationarity: a well known theorem says that, under weak conditions,

P(Xk = u) µ(u) (when k ∞)

(and the rate of convergence is exponential.)

But biological sequences are not homogeneous.

There are g+c rich segments / g+c poor segments (isochores).

One may presume (and verify) that the rules of succession of letters differ in coding parts / non-coding parts.

Is it possible to take avantage of this problem and to develop a tool for the analysis of heterogeneity ? => annotation

Page 22: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

HMM – 2

Suppose that d states alternate along the sequence

And in each state we have a MC :

if Sk = 1, then P(Xk = v | Xk–1 = u) = π1(u ; v)

if Sk = 2, then P(Xk = v | Xk–1 = u) = π2(u ; v)

and (more technical than biological - see HSMM)

P(Sk = y | Sk–1 = x) = π0(u ; v)

Sk = 1 Sk = 2 Sk = 1 Sk = 2 Sk = 1

Our objectives• Estimate the parameters π1, π2, π0

• Allocate a state {1, 2} to each position

Page 23: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

HMM – 3

¡ Use the likelihood !!

L() = ∑ µ0(S1) µS (X1) ....1

...∏ π0(Sk-1,Sk) πk (Xk-1,Xk)

n terms (length of the sequence)

over all possibilities S1S2...Sn ; there are sn terms

210 000 = 103 000 Désespoir !!!

Page 24: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.


Page 25: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

H.M.M. continue

Page 26: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Searching nucleosome positions

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

In eukaryotes (only), an important part of the chromosomes forms chromatine, a state where the double helix winds round “beads“ forming a collar :

Each bead is called a nucleosome. Its core is a complex involving 8 proteins (an octamer) called histone (H2A, H2B, H3, H4). DNA winds twice this core and is locked by an other histone (H1). The total weight of the histones is ± equal to the weight of the DNA.

|||10 nm

Page 27: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Curvature within curvature

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

The DNA helix turns twice around the histone core. Each turn corresponds to about 7 pitches of the helix, each one made with about 10 nucleotides.

Total = 146 nt within each nucleosome.

Depending on the position (“in”vs “out”) the curvature satisfies different constraints

Page 28: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Nuc and “no-nuc” states

. . .1 2 70



nucleosome core

Trifonov (99) as well as Rando (05) underline that there are ‘no‘ nucleosome in the gene promotors (accessibility)The introduce “before“ nucleosome a “no-nucleosme” state.

Ioshikhes, Trifonov, Zhang Proc. Natl Acad. Sc. 96 (1999)Yuan, Liu, Dion, Slack, Wu, Altschuler, Rando, Sciencexpress (2005)

Page 29: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.


Following an idea (Baldi, Lavery,...) we introduce an indice of bendability ; it depends on succession of 2, 3, 4, ...di-nucleotides.










Page 30: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

PNUC table2nd letter

a t g c

a 0.0 2.8 3,3 5.2 at 7,3 7,3 10,0 10,0 ag 3,0 6,4 6,2 7,5 ac 3,3 2,2 8,3 5,4 a

a 0.7 0.7 5,8 5,8 tt 2.8 0.0 5.2 3,3 tg 5.3 3.7 5.4 7,5 tc 6,7 5.2 5.4 5.4 t

1rst letter 3rd lettera 5.2 6,7 5.4 5.4 gt 2,2 3,3 5.4 8,3 gg 5.4 6,5 6,0 7,5 gc 4,2 4,2 4,7 4,7 g

a 3.7 5.3 7,5 5.4 ct 6,4 3,0 7,5 6,2 cg 5.6 5.6 8.2 8.2 cc 6,5 5.4 7,5 6,0 c

PNUC(cga) = 8,3

There exist various tables which indicate the bendability of di-, tri or even tetra-nucleotides (PNUC, DNase, ...)

We used PNUC-3 :

(*) Goodsell, Dickerson, NAR 22 (1994)

PNUC(tcg) = 8,3

Page 31: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

Scan of K3 of yeast Sometime it works :

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

Page 32: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

What about positions ?

We represent (*) parts of the chromosome K3 of Yeast

QuickTime™ et undécompresseur TIFF (LZW)

sont requis pour visionner cette image.

The green curve (“proba” of the no-nuc state) increases between genes (promotors)

The red curve (“proba” of the nucleosome state) appears periodically in genes.(*) using the software

MuGeN, by Mark Hoebeke

Page 33: Analysis of biological sequences using Markov Chains and Hidden Markov Models Bernard PRUM, La genopole – Evry – France Colloque.

AcknowledgementsLabo «Statistique et Génome» Labo MIG – INRA

Christopha AMBROISE Philippe BESSIÈREMaurice BAUDRY François RODOLPHEEtienne BIRMELE Sophie SCHBATHCécile COT Élisabeth de TURCKHEIMEmmanuelle DELLA-CHIESAMark HOEBEKEMickael GUEDJFrançois KÉPÈS Labo AGROSophie LEBRE Jean-Noël BACROCatherine MATIAS Jean-Jacques DAUDINVincent MIELE Stéphane ROBINFlorence MURI-MAJOUBEGrégory NUELFranck PICARD Lab’ RouenHugues RICHARD Dominique CELLIERAnne-Sophie TOCQUET Sabine MERCIERNicolas VERGNE

Sec : Michèle ILBERT