Taikai Takeda , Michiaki Hamada Waseda University, AIST...

1
Model Selection on Pairwise Hidden Markov Models using Factorized Information Criterion Taikai Takeda 1 , Michiaki Hamada 1,2 1 Waseda University, 2 AIST-Waseda CBBD OIL 55N-06-10, 3-4-1, Okubo Shinjuku-ku, Tokyo 169-8555, Japan. E-mail: [email protected] © 2016 T. Takeda et al. Abstract Although extensive research has been done on biological sequence alignment problem, model selection problem remains untouched. We assume the model selection enables us to extract biological knowledge by interpreting resulting models, e.g. by comparing resulting models of pairwise DNA sequence alignment between some different pairs of species. As a model selection method, we introduce Factorized Asymptotic Bayesian Pairwise Hidden Markov Models (FAB-PHMM), based on asymptotic consistent information criterion with model evidence. We conducted an experiment on a synthetic dataset to illustrate model selection capability of proposed method. On a real DNA sequence data experiment, we observed that much more complex models are selected than previously utilized models, and the result is consistent with evolutionary distance. 1) Biological function annotation Each DNA region has different biological functions. Assuming that each of those regions are generated from different probabilistic distributions corresponding to a specific hidden states, we can annotate the regions with function type by inspecting posterior of hidden states. function A gap function B function C DNA sequence X DNA sequence Y C--TAGGGGATCGAATCAAATACG CATTAGGAAATCGAATC----ACG sequence X sequence Y Y-insertion substitution X-insertion Pairwise sequence alignment Pairwise sequence alignment aims to assess similarity of two sequences by introducing “gap”. Gaps are represented as “-” and the gap columns have the nucleotides only in single side. In score-based alignment [Smith+1981], a similarity score is calculated as the sum of column scores; match columns make positive contribution and insertion/substitution columns affect negatively. The optimal alignment is given by Dynamic Programming, which seek the optimal gap insertions that maximize the similarity scores. Similarity of the sequence pair can be assessed by the score of optimal alignment. Pairwise Hidden Markov Models (PHMM) Pairwise Hidden Markov Models are probabilistic models [Durbin+, 1999] of pairwise sequence alignment. Using this model, alignment can be obtained via MAP decoding. In addition, parameters can be learned using EM algorithm. Y-insertion columns X-insertion columns Other columns Transition diagram of hidden states {M, X, Y} 2) Evolutionary distance We assume resulting model reflect evolutionary distance; selected model of evolutionary further species should have complex model. The selected model might summarize evolutionary behavior of them. Question: Which is the “best” model of PHMM? The model of PHMM is parameterized by the number of hidden states for each type = ( % , , ( ). Our goal is to select the best model that maximize a marginal likelihood given a observed data. Factorized Asymptotic Bayesian Pairwise Hidden Markov Models (FAB-PHMM) Following the FAB-HMM [Fujimaki+2012], we apply Factorized Information Criterion (FIC) for PHMM. Our goal is to maximize FIC, which is an asymptotically consistent approximation of marginal likelihood. Background Problem Setting Motivations Methods Formulation of HMM and PHMM HMM PHMM --GG AAGG G A G G G A Maximization of marginal likelihood Variational Lower Bound We introduce variational distribution ()to take lower-bound of the marginal likelihood. Factorized Information Criterion (FIC) FIC (x, y |M) = max q (z ) X z q (z ) log p(x, y , z | ¯ ) - D 2 log N - K X k=1 D β k 2 log( X n,t,u z n tuk ) - K X k=1 D φ k 2 log( X n,t,u z n tuk ) - log q (z ) Laplace approximate on each terms and ignore asymptotically small terms w.r.t. FIC Lower Bound We optimize FICLB through EM-like iteration. This lower-bound is obtained by using 1) linear approximation of logarithm function log ≤ log − ( − )/ and 2) optimality of ML estimator log (, , | : ) ≥ log (, , |) Heuristic model Pruning Experiments Model selection on synthetic data Size = % , , ( small (1,1,1) medium (1,2,2) large (3,2,2) Real DNA sequence species A vs B species A vs C Model Y Pairwise sequence comparison among some species Model X Model selection Dierence between model X and Y might reflect dierence between B and C Human vs Orangutan Human vs Mouse Model selection one additional hidden state -> longer insertion of human-side nucleotides D1-32 log p(x, z |)= N X n=1 log p(z n 1 |)+ T X t=1 log p(z n t |z n t-1 , β )+log p(x n t |z n t , φ) log p(x, y , z |)= N X n=1 log p(z n in |)+ T X X t=0 T Y X u=0 log p(z n tu |pa(z n tu ), β )+log p(x n t ,y n u |z n tu , φ) p(x, y , z |M)= Z N Y n=1 p(z n in |) T X Y t=0 T Y Y u=0 K Y k=1 p k (z n tu |β k ) pa(z n tu ) k p(x n t ,y n u |z n tu , φ k ) z n tuk p(|M)dFIC FICLB (x, y ,q, ˜ q, , M)= X n X z n q (z n ) log p(x n , y n , z n |)+ X tuk z n tuk log δ tuk - log q (z ) - D 2 log N - K X k=1 D β k 2 log ( X n,t,u ˜ q (z n tuk ) - 1 ) - K X k=1 D φ k 2 log ( X n,t,u\(,T n X ,T n Y ) ˜ q (z n tuk ) - 1 ) δ tuk = 8 > > > < > > > : exp - D φ k 2 P ntu z n tuk if t = T X and u = T Y exp - D φ k 2 P ntu ˜ q (z n tuk ) - D β k 2 P ntu\(,T n X ,T n Y ) ˜ qz n tuk otherwise

Transcript of Taikai Takeda , Michiaki Hamada Waseda University, AIST...

Page 1: Taikai Takeda , Michiaki Hamada Waseda University, AIST ...t-takeda.com/downloads/ibis2016_poster.pdf · Model Selection on Pairwise Hidden Markov Models using Factorized Information

ModelSelectiononPairwiseHiddenMarkovModelsusingFactorizedInformationCriterion

Taikai Takeda1, Michiaki Hamada1,2 1WasedaUniversity,2AIST-WasedaCBBDOIL

55N-06-10,3-4-1,Okubo Shinjuku-ku,Tokyo169-8555,Japan.E-mail:[email protected] ©2016T.Takeda etal.

AbstractAlthough extensive research has been done on biological sequence alignment problem, model selection problem remains untouched. We assume the model selection enables us to extract biological knowledge by interpreting resulting models, e.g. by comparing resulting models of pairwise DNA sequence alignment between some different pairs of species. As a model selection method, we introduce Factorized Asymptotic Bayesian Pairwise Hidden Markov Models (FAB-PHMM), based on asymptotic consistent information criterion with model evidence. We conducted an experiment on a synthetic dataset to illustrate model selection capability of proposed method. On a real DNA sequence data experiment, we observed that much more complex models are selected than previously utilized models, and the result is consistent with evolutionary distance.

1) Biological function annotationEach DNA region has different biological functions. Assuming that each of those regions are generated from different probabilistic distributions corresponding to a specific hidden states, we can annotate the regions with function type by inspecting posterior of hidden states.

function A gap function B function C

DNA sequence X

DNA sequence Y

C--TAGGGGATCGAATCAAATACG

CATTAGGAAATCGAATC----ACG

sequence X

sequence YY-insertion substitution X-insertion

Pairwise sequence alignmentPairwise sequence alignment aims to assess similarity of two sequences by introducing “gap”. Gaps are represented as “-” and the gap columns have the nucleotides only in single side. In score-based alignment [Smith+1981], a similarity score is calculated as the sum of column scores; match columns make positive contribution and insertion/substitution columns affect negatively. The optimal alignment is given by Dynamic Programming, which seek the optimal gap insertions that maximize the similarity scores. Similarity of the sequence pair can be assessed by the score of optimal alignment.

Pairwise Hidden Markov Models (PHMM)Pairwise Hidden Markov Models are probabilistic models [Durbin+, 1999] of pairwise sequence alignment. Using this model, alignment can be obtained via MAP decoding. In addition, parameters can be learned using EM algorithm.

Y-insertioncolumns

X-insertioncolumns

Othercolumns

Transition diagram of hidden states {M, X, Y}

2) Evolutionary distanceWe assume resulting model reflect evolutionary distance; selected model of evolutionary further species should have complex model. The selected model might summarize evolutionary behavior of them.

Question: Which is the “best” model of PHMM?The model of PHMM is parameterized by the number of hidden states for each type 𝑀 = (𝐾%,𝐾', 𝐾(). Our goal is to select the best model 𝑀 that maximize a marginal likelihood given a observed data.

Factorized Asymptotic Bayesian Pairwise Hidden Markov Models (FAB-PHMM)Following the FAB-HMM [Fujimaki+2012], we apply Factorized Information Criterion (FIC) for PHMM. Our goal is to maximize FIC, which is an asymptotically consistent approximation of marginal likelihood.

Background

ProblemSetting

Motivations

MethodsFormulation of HMM and PHMM

HMM

PHMM

--GGAAGG

GA G

GG

A

Maximization of marginal likelihood

Variational Lower BoundWe introduce variational distribution 𝑞(𝒛)to take lower-bound of the marginal likelihood.

Factorized Information Criterion (FIC)

FIC(x,y|M) = max

q(z)

X

z

q(z)

✓log p(x,y, z|¯✓)� D↵

2

logN �KX

k=1

D�k

2

log(

X

n,t,u

zntuk)

�KX

k=1

D�k

2

log(

X

n,t,u

zntuk)� log q(z)

Laplace approximate on each terms and ignore asymptotically small terms w.r.t. 𝑁

FIC Lower BoundWe optimize FICLB through EM-like iteration. This lower-bound is obtained by using 1) linear approximation of logarithm function log 𝑎 ≤ log 𝑏 − (𝑎 − 𝑏)/𝑏 and 2) optimality of ML estimator log 𝑝(𝒙, 𝒚, 𝒛|𝜽:) ≥ log 𝑝(𝒙, 𝒚, 𝒛|𝜽)

Heuristic model Pruning

ExperimentsModel selection on synthetic data

Size 𝑀∗ = 𝐾%∗ , 𝐾'∗, 𝐾(∗

small (1,1,1)medium (1,2,2)large (3,2,2)

Real DNA sequence

species A vs B

species A vs C

Model Y

Pairwise sequence comparison among some species

Model X

Model selection

Difference between model X and Y might reflect difference between B and C

Human vs Orangutan Human vs Mouse

Model selection

one additional hidden state-> longer insertion of human-side nucleotides

D1-32

log p(x, z|✓) =NX

n=1

log p(z

n1 |↵)+

TX

t=1

⇣log p(z

nt |zn

t�1,�)+log p(x

nt |zn

t ,�)

⌘�

log p(x,y, z|✓) =NX

n=1

log p(z

nin|↵)+

TXX

t=0

TYX

u=0

⇣log p(z

ntu|pa(zn

tu),�)+log p(x

nt , y

nu |zn

tu,�)

⌘�

p(x,y, z|M) =

Z NY

n=1

p(znin|↵)

TXY

t=0

TYY

u=0

KY

k=1

pk(zntu|�k)

pa(zntu)k

p(xnt , y

nu |zn

tu,�k)zntuk

p(✓|M)d✓

FIC � FICLB(x,y, q, q̃,✓,M) =

X

n

X

zn

q(zn)

✓log p(xn,yn, zn|✓) +

X

tuk

zntuk log �tuk � log q(z)

� D↵

2

logN �KX

k=1

D�k

2

log

� X

n,t,u

q̃(zntuk)� 1

��

KX

k=1

D�k

2

log

� X

n,t,u\(⇤,TnX ,Tn

Y )

q̃(zntuk)� 1

�tuk =

8>>><

>>>:

exp

✓� D�k

2

Pntu z

ntuk

◆if t = TX and u = TY

exp

✓� D�k

2

Pntu q̃(z

ntuk)

� D�k

2

Pntu\(⇤,Tn

X ,TnY ) q̃z

ntuk

◆otherwise