Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

30
Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09 The Problem Statistical Alignment - Annotation - Annotation & Statistical Alignment Statistical Alignment The Model The Pairwise Algorithm – the HMM connection Multiple sequence alignment algorithms Annotation Ahead Annotation & Alignment The general problem protein secondary structure – protein genes – RNA structure - signal The general algorithm Signals (footprinting) Protein Secondary Structure Prediction Transcription Factor Prediction - Knowledge transfer - homologous/nonhomologous analysis

description

Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09. The Problem. Statistical Alignment - Annotation - Annotation & Statistical Alignment. Statistical Alignment. The Model The Pairwise Algorithm – the HMM connection Multiple sequence alignment algorithms. Annotation. - PowerPoint PPT Presentation

Transcript of Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Page 1: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Statistical Alignment and FootprintingRutgers – DIMACS 27.4.09

The Problem• Statistical Alignment - Annotation - Annotation & Statistical Alignment

Statistical Alignment• The Model• The Pairwise Algorithm – the HMM connection• Multiple sequence alignment algorithms

Annotation

Ahead

Annotation & Alignment

• The general problem • protein secondary structure – protein genes – RNA structure - signal

• The general algorithm• Signals (footprinting)• Protein Secondary Structure Prediction

• Transcription Factor Prediction - Knowledge transfer - homologous/nonhomologous analysis

Page 2: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Sequence Evolution and AnnotationAlignment and Footprinting

Unobservable

Unobservable

Goldman, Thorne & Jones, 96

UC G

AC

AU

AC

Knudsen.., 99

Eddy & co.

Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03

Footprinting -Signals (Blanchette)

Observable

Observable

ACGTC

ACG-C

AGGCC

AGGCT

AGGCT

AGG-T

AGGTTA-CTC

Page 3: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Thorne-Kishino-Felsenstein (1991) Process

(birth rate) (death rate)

A # C G

# ##

#

T= 0

T = t

#

P(s) = (1-)()l A#A* .. * T

#T l =length(s)

# - - - # # # #

*

Page 4: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

& into Alignment BlocksA. Amino Acids Ignored:

e-t[1-]()k-1

# - - - # # # # k

# - - - -- # # # # k

=[1-e()t]/[e()t]

pk(t) p’k(t)

[1--]()k

p’0(t)= (t)

* - - - -* # # # # k[1-]()k

p’’k(t)

B. Amino Acids Considered:

T - - -R Q S W Pt(T-->R)*Q*..*W*p4(t) 4

T - - - -- R Q S W R *Q*..*W*p’4(t) 4

Page 5: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Basic Pairwise Recursion (O(length3))

i

j

P(s1i s2 j )

(i,j)

i

j

i-1

j-1

(i-1,j)

(i-1,j-1)

survivedeath

(i-1,j-k)

…………..

…………..…………..

Initial condition:

p’’=s2[1:j]

Page 6: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

-globin (141) and -globin (146)(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)

430.108 : -log(-globin) 327.320 : -log(-globin --> -globin) 747.428 : -log(-globin, -globin) = -log(l(sumalign))

*t: 0.0371805 +/- 0.0135899*t: 0.0374396 +/- 0.0136846s*t: 0.91701 +/- 0.119556

Maximum contributing alignment:V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADALTVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS

NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

Ratio l(maxalign)/l(sumalign) = 0.00565064

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

-globin

-globin

Page 7: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

- # # E # # - E ** e- e-

-# e- e-

_# e- e-

#-

1 e

1 e

e

1 e

( )1 e

An HMM Generating Alignments

Statistical AlignmentSteel and Hein,2001 + Holmes and Bruno,2001

C

T

C A C

Emit functions:e(##)= (N1)f(N1,N2)e(#-)= (N1), e(-#)= (N2)

(N1) - equilibrium prob. of Nf(N1,N2) - prob. that N1 evolves into N2

Page 8: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Why multiple statistical alignment is non-trivial. Steel & Hein, 2001, Hein, 2001, Holmes and Bruno, 2001

* ()*######a

s1 s2

s3

*ACG C *TT GT

*ACG G

3

2

1

5

4 0 # -1 - -2 # -3 # #4 - #5 # #

0 # -

• An HMM generating alignment according to TKF91:

Page 9: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Human alpha hemoglobin;Human beta hemoglobin;Human myoglobinBean leghemoglobin

Probability of data e -1560.138

Probability of data and alignment e-1593.223

Probability of alignment given data 4.279 * 10-15 = e-33.085

Ratio of insertion-deletions to substitutions: 0.0334

Maximum likelihood phylogeny and alignmentG

erton Lunter, Istvan Miklos, Alexei D

rumm

ond, Yun Song

Page 10: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Metropolis-Hastings Statistical Alignment.Lunter, Drummond, Miklos, Jensen & Hein, 2005

The alignment moves:We choose a random window in the current alignment

ALITL---GGALLTLTTLGG---TLTSLGAALLGLTSLGA

TNQHVSCTGNGN-HVSCTGKTNQH-SCTLNTNQHVSCTLN

QST--QCC-SS------CCS---QST--QC---QST--QC

ALITL---GGALLTLTTLGG---TLTSLGAALLGLTSLGA

TNQHVSCTGNGN-HVSCTGKTNQH-SCTLNTNQHVSCTLN

QSTQCCSSCCSQSTQCQSTQC

Then delete all gaps so we get back subsequences

Stochastically realign this part

ALITL---GGALLTLTTLGG---TLTSLGAALLGLTSLGA

TNQHVSCTGNGN-HVSCTGKTNQH-SCTLNTNQHVSCTLN

QSTQCCS-S--CCSQSTQC--QSTQC--

The phylogeny moves:As in Drummond et al. 2002

Page 11: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Metropolis-Hastings Statistical AlignmentLunter, Drummond, Miklos, Jensen & Hein, 2005

Page 12: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

How to proceed to many many sequences ??

• Dynamical Programming stops at 4-5 sequences

• MCMC stops at 10-13ish sequences

• Some approximations must be adopted

• “Temporal Corner cutting”

• Degenerate Genealogical Structures

Page 13: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Many Sequences: Sequence GraphsIstvan Miklos – Gerton Lunter – Miklos Csuros

Investigate a set of ancestral sequences/alignments that are computationally realistic

• A set of homologous sequences are given

ccgttagct ccgttagct ccgttagct ccgttagct

• With a known phylogeny

• Pairs of sequences are aligned

• Graphs defined representing alignment/ancestral sequences

• Pairs of graphs aligned….

Page 14: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

FSA - Fast Statistical Alignment Pachter, Holmes & Co

http

://m

ath.

berk

eley

.edu

/~rb

radl

ey/p

aper

s/m

anua

l.pdf

12

k

1

3

k

2

4

Spanning tree Additional edges

An edge – a pairwise alignment

12

Data – k genomes/sequences:

1,3 2,3 3,4 3,k

12 2,k 1,4 4,k

Iterative addition of homology statements to shrinking alignment:

Add most certain homology statement from pairwise alignment compatible with present multiple alignment

i. Conflicting homology statements cannot be addedii. Some scoring on multiple sequence homology

statements is used.

Page 15: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Li-Stephens

1

2

3

4

3

1 2

1 2 1 2

4

3 3

34 4

34

1 2

34

1 2

34

1 2

Simplifications relative to the Ancestral Recombination Graph (ARG)

Are there intermediates between Spanning Trees and Steiner Trees?

Local Trees are Spanning Trees – not phylogenies (Steiner Trees)

No non-ancestral bridges between ancestral material

Page 16: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Spannoids – k-restricted Steiner TreesBaudis et al. (2000) Approximating Minimum Spanning Sets in Hypergraphs and Polymatroids

1 2

3

4

1

2

3

4Spanning tree

Steiner tree

2

5

4

1

3

2

5

4

1

6

3

1-Spannoid2-Spannoid

Advantage: Decomposes large trees into small trees

Questions: How to find optimal spannoid?

How well do they approximate?

Page 17: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Example – Contraction of Simulated Coalescent TreesSimulation

• Trees simulated from the coalescent

• Spannoid algorithm:

Conclusion

• Approximation very good for k >5

• Not very dependent on sequence number

Page 18: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Annotation & Annotation with alignment

• Annotation

• Annotation and alignment

• Three Programs

• SAPF – dynamic programming up to 4 sequences

• BigFoot– MCMCup to 13 sequences

• GRAPEfoot – pairwise genome footprinting

• Footprinting

Page 19: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

The Basics of Evolutionary Annotation

Many aligned sequences related by a known phylogeny

n1positions

seque

nces

k

1

slow - rs

fast - rf

HMM

)()(

)()(

SequencePSequenceStructureP

StructurePStructureSequenceP

Unobservable

Unobservable

UC G

AC

AU

AC

Knudsen.., 99

Eddy & co.

Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03

Footprinting -Signals (Churchill and Felsenstein, 96)

Page 20: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Statistical Alignment and Footprinting.seq

uenc

es

k

1

Alignment HMM

acgtttgaaccgag----

Signal HMM

Alignment HMM

seque

nces

k

1 acgtttgaaccgag----

seque

nces

k

1 acgtttgaaccgag----

Alignment HMM

Stru

ctur

e H

MM

(A,S)

Page 21: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

“Structure” does not stem from an evolutionary model

Structure HMM

S

F

0.1

S

F0.1

F

FF

0.9

F

FS

0.1

S

SS

0.9

S

SF

0.1

•The equilibrium annotation

does not follow a Markov Chain:?

F F S S F

•Each alignment in from the Alignment HMM is annotated by the Structure HMM.

using the HMM at the alignment will give other distributions on the leaves

• No ideal way of simulating:

using the HMM at the root will give other distributions on the leaves

Page 22: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

An example: Footprinting

Page 23: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Summing Out is Better Satija et al.,2008

True

pos

itive

rate

False positive rate

True

pos

itive

rate

False positive rate

Simulated data with parameter estimated from Eve Stripe 2.

DIS – summing out alignmentsMPP – fixing on 1 alignment

As above but with higher insertion-deletion rate.

Page 24: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Signal Factor Prediction

http://jaspar.cgb.ki.se/ http://www.gene-regulation.com/

• Given set of homologous sequences and set of transcription factors (TFs), find signals and which TFs they bind to.

• Use PWM and Bruno-Halpern (BH) method to make TF specific evolutionary models

• Drawback BH only uses rates and equilibrium distribution

• Superior method: Infer TF Specific Position Specific evolutionary model

• Drawback: cannot be done without large scale data on TF-signal binding.

Page 25: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Knowledge Transfer and Combining Annotations

mouse pig

human

Experimental observations

• Annotation Transfer

• Observed Evolution

Must be solvable by Bayesian Priors Each position pi probability of being j’th position in k’th TFBS If no experiment, low probability for being in TFBS

prio

r

1 experimentally annotated genome (Mouse)

Page 26: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

(Homologous + Non-homologous) detection

Wang and Stormo (2003) “Combining phylogenetic data with co-regulated genes to identify regulatory motifs” Bioinformatics 19.18.2369-80

Zhou and Wong (2007) Coupling Hidden Markov Models for discovery of cis-regulatory signals in multiple species Annals Statistics 1.1.36-65

genepromotor

Unrelated genes - similar expression Related genes - similar expression

Combine above approaches

Combine “profiles”

Page 27: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

StatAlign software packagehttp://phylogeny-café.elte.hu/StatAlign/statalign.tar.gz

•Written in Java 1.5

•Platform-independent graphical interface

•Jar file is available, no need to instal

•Open source, extendable modules

Page 28: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

SummaryThe Problem

• Statistical Alignment - Annotation - Annotation & Statistical Alignment

Statistical Alignment• The Model• The Pairwise Algorithm – the HMM connection• the multiple sequence alignment algorithm

Annotation

Ahead

Annotation & Alignment

• The general problem • protein secondary structure – protein genes – RNA structure - signal

• The general algorithm• Signals (footprinting)• Protein Secondary Structure Prediction

• Transcription Factor Prediction - Knowledge transfer - homologous/nonhomologous analysis

Page 29: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

AcknowledgementsFootprinting: Rahul Satija, Lior Pachter, Gerton Lunter

MCMC: Istvan Miklos, Jens Ledet Jensen, Alex Drummond,

Program: Adam Novak, Rune Lyngsø

Spannoids: Jesper Nielsen, Christian Storm

Earlier Statistical alignment Collaborators Mike Steel, Yun Song, Carsten Wiuf, Bjarne Knudsen, Gustav Wiebling, Christian Storm, Morten Møller,

Funding BBSRC MRC Rhodes Foundation

Software http://phylogeny-café.elte.hu/StatAlign/statalign.tar.gz

Next steps http://www.stats.ox.ac.uk/research/genome/projects

Page 30: Statistical Alignment and Footprinting Rutgers – DIMACS 27.4.09

Statistical Aligment and FootprintingStatistical Alignment and Footprinting Although bioinformatics perceived is a new discipline, certain parts have a long history and could be viewed as classical bioinformatics. For example, application of string comparison algorithms to sequence alignment has a history spanning the last three decades, beginning with the pioneering paper by Needleman and Wunch, 1970. They used dynamic programming to maximize a similarity score based on a cost of insertion-deletions and a score function on matched amino acids. The principle of choosing solutions by minimizing the amount of evolution is also called parsimony and has been widespread in phylogenetic analysis even if there is no alignment problem. This situation is likely to change significantly in the coming years. After a pioneering paper by Bishop and Thompson (1986) that introduced and approximated likelihood calculation, Thorne, Kishino and Felsenstein (1991) proposed a well defined time reversible Markov model for insertion and deletions (the TKF91-model), that allowed a proper statistical analysis for two sequences. Such an analysis can be used to provide maximum likelihood (pairwise) sequence alignments, or to estimate the evolutionary distance between two sequences. Steel et al. (2001) generalized this to any number of sequences related by a star tree. This was subsequently generalized further to any phylogeny and more practical methods based on MCMC has been developed. We have developed this into a generally available program package. Traditional alignment-based phylogenetic footprinting approaches make predictions on the basis of a single assumed alignment. The predictions are therefore highly sensitive to alignment errors or regions of alignment uncertainty. Alternatively, statistical alignment methods provide a framework for performing phylogenetic analyses by examining a distribution of alignments. We developed a novel algorithm for predicting functional elements by combining statistical alignment and phylogenetic footprinting (SAPF). SAPF simultaneously performs both alignment and annotation by combining phylogenetic footprinting techniques with an hidden Markov model (HMM) transducer-based multiple alignment model, and can analyze sequence data from multiple sequences. We assessed SAPF's predictive performance on two simulated datasets and three well-annotated cis-regulatory modules from newly sequenced Drosophila genomes. The results demonstrate that removing the traditional dependence on a single alignment can significantly augment the predictive performance, especially when there is uncertainty in the alignment of functional regions. The transducer-based version of SAPF is currently able to analyze data from up to five sequences. We are currently developing an MCMC approach that we hope will be capable of analyzing data from 12-16 species, enabling the user to input sequence data from all 12 recently sequenced Drosophila genomes. We will present initial results from the MCMC version of SAPF and discuss some of the challenges and difficulties affecting the speed of convergence.