Post on 21-Dec-2015
Approaches to Sequence Analysis
s2 s3 s4s1
statistics
GT-CAT
GTTGGT
GT-CA-
CT-CA-
Parsimony, similarity, optimisation.
Data {GTCAT,GTTGGT,GTCA,CTCA}
Actual Practice: 2 phase analysis.
Ideal Practice: 1 phase analysis.
1. TKF91 - The combined
substitution/indel process.
2. Acceleration of Basic
Algorithm
3. Many Sequence Algorithm
4. MCMC Approaches
Thorne-Kishino-Felsenstein (1991) Process
(birth rate) (death rate)
A # C G
# ##
#
T= 0
T = t
#
s2
s1
s1 s2
r
s1 s22. Time reversible:
1. P(s) = (1-)()l A#A* .. * T
#T l =length(s)
# - - -
# # # #
*
& into Alignment BlocksA. Amino Acids Ignored:
e-t[1-]()k-1
# - - - # # # # k
# - - - -- # # # # k
=[1-e()t]/[e()t]
pk(t)p’k(t)
[1--]()k
p’0(t)= (t)
* - - - -* # # # # k
[1-]()k
p’’k(t)
B. Amino Acids Considered:
T - - -R Q S W Pt(T-->R)*Q*..*W*p4(t) 4
T - - - -- R Q S W R *Q*..*W*p’4(t) 4
# - - ... -# # # ... #
Differential Equations for p-functions
# - - - ... -- # # # ... #
* - - - ... -* # # # ... #
Initial Conditions: pk(0)= pk’’(0)= p’k (0)= 0 k>1 p1(0)= p0’’(0)= 1. p’0 (0)= 0
pk = t*[*(k-1) pk-1 + *k*pk+1 - ()*k*pk]
p’k=t*[*(k-1) p’k-1+*(k+1)*p’k+1-()*k*p’k+*pk+1]
p’’k=t*[*k*p’’k-1+*(k+1)*p’’k+1- [(k+1)+k]*p’’k]
Basic Pairwise Recursion (O(length3))
Survives: Dies:
i-1j-2
i
j
i-1 i
j-1 j
……………………
1… j (j) cases
……………………
j
i-1 i
j
ii-1
j-1
])[2(*'*)21( 111 jspssP ji
0… j (j+1) cases
…………………………………………
……………………
i
j
P(s1i s2 j )
(s2[ j])
f (s1[i],s2[ j 1])
p2
P(s1i 1 s2 j 2)
e-t[1-]()k-1, where
=[1-e()t]/[e()t]
Basic Pairwise Recursion (O(length3))
(i,j)
i
j
i-1
j-1
(i-1,j)
(i-1,j-1)
survive
death
(i-1,j-k)
…………..
…………..…………..
Initial condition:
p’’=s2[1:j]
Accelleration of Pairwise Algorithm(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)
Corner Cutting ~100-1000
Better Numerical Search ~10-100Ex.: good start guess, 28 evaluations, 3 iterations
Simpler Recursion ~3-10
Faster Computers ~250
1991-->2000 ~106
-globin (141) and -globin (146)(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)
430.108 : -log(-globin) 327.320 : -log(-globin --> -globin) 747.428 : -log(-globin, -globin) = -log(l(sumalign))
*t: 0.0371805 +/- 0.0135899*t: 0.0374396 +/- 0.0136846s*t: 0.91701 +/- 0.119556
E(Length) E(Insertions,Deletions) E(Substitutions) 143.499 5.37255 131.59
Maximum contributing alignment:
V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADALTVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS
NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
Ratio l(maxalign)/l(sumalign) = 0.00565064
The invasion of the immortal link
VLSPADNAL.....DLHAHKR 141 AA long
???????????????????? k AA long
2 107 years
2 108 years
2 109 years
*########### …. ### 141 AA long
*########### …. ###
*########### …. ###
109 years
Algorithm for alignment on star tree (O(length6))(Steel & Hein, 2001)
* ()*######
P(S) (1)[P*(S)
P# (Tail )P(S Tail)]
a
s1 s2
s3
*ACGC *TT GT
*ACG GT
Binary Tree Problem
The problem would be simpler if:
s1
s2
s3
s4
a1 a2
ACCT
GTT
TGA
ACG
A Markov chain generating ancestral alignments can solve the problem!!
a1 a2* *# ## -- ## #- #
i. The ancestral sequences & their alignment was known.
ii. The alignment of ancestral alignment columns to leaf sequences was known
How to sum over all possible ancestral sequences and their alignments?:
- # # E # # - E ** e- e-
## e- e-
_# e- e-
#-
1 e
1 e
e
1 e
( )1 e
Generating Ancestral Alignments
a1 *a2 *
- #
# # e-
E E
The Basic Recursion
S E
”Remove 1st step” - recursion:
”Remove last step” - recursion:
Last/First step removal are inequivalent, but have the same complexities.
First step algorithm is the simplest.
Sequence Recursion: First Step Removal
iS
P '(k Si ,H )H C
P( )P (Si)
P(Sk): Epifixes (S[k+1:l]) starting in given MC starts in .
P(Sk) = E
( p' kj:H ( j )0
(t j ) sj [i( j) : k( j)])( pkj:H( j )1
( t j ) sj [i( j)1: k( j)])F(kSi,H)
Where P’(kS i,H =
Human alpha hemoglobin;Human beta hemoglobin;Human myoglobinBean leghemoglobin
Probability of data e -1560.138
Probability of data and alignment e-1593.223
Probability of alignment given data 4.279 * 10-15 = e-33.085
Ratio of insertion-deletions to substitutions: 0.0334
Maximum likelihood phylogeny and alignment
Gerton Lunter
Istvan Miklos
Alexei Drummond
Yun Song
Metropolis-Hastings Statistical AlignmentLunter, Drummond, Miklos, Jensen & Hein, 2005