Lecture 12: Sequence alignment algorithms 2

24
BINF6201/8201 Sequence alignment algorithms 2 10-12-2016

Transcript of Lecture 12: Sequence alignment algorithms 2

Page 1: Lecture 12: Sequence alignment algorithms 2

BINF6201/8201

Sequence alignment algorithms 2

10-12-2016

Page 2: Lecture 12: Sequence alignment algorithms 2

Global alignment vs. local alignment Ø Needleman-Wunsch algorithm gives the optimal alignment of two

sequences using the entire region of the two sequences, therefore, the resulting alignment is a global alignment.

Ø We compute the global alignment of two sequences when we believe that the domain arrangements of the two sequences are similar.

Ø However, very often, we are more interested in aligning the sub-regions/domains of two sequences. In such cases, we want to find the local optimal alignment between two sequences.

a

b

a

b

Global alignment

Local alignment Non-alignable regions

Page 3: Lecture 12: Sequence alignment algorithms 2

Smith-Waterman local alignment algorithm Ø Smith-Waterman algorithm (1981) uses dynamic programming to find

the optimal local alignment between two sequences.

Ø The algorithm is modified from the Needleman-Wunsch algorithm. Ø To identify the optimal local alignment, the algorithm terminates an

ongoing alignment between the first i letters of a and the first j letters of b, if the alignment is not promising, i.e., when H(i,j) is negative, and restarts another new alignment by assigning H(i,j)=0.

Ø Therefore, H(i,j) is the score for the alignment starting from the last terminating point to the i-th and j-th positions in a and b, respectively.

Terminate the alignment, if H(i,j)< 0, and assign H(i,j) = 0

Terminate the alignment, if H(i,j)< 0, and assign H(i,j) = 0

Maximal H(i,j)

Alignment contains the optimal local alignment

Optimal local alignment

Page 4: Lecture 12: Sequence alignment algorithms 2

Smith-Waterman local alignment algorithm Ø If we use the linear gap penalty function, then the recursion relation of

the Smith-Waterman algorithm is,

with the initial condition H(i,0)=0, and H(j,0)=0.

H (i, j) =max

H (i−1, j −1)+ S(ai,bj ) diagonalH (i−1, j)− g vertical H (i, j −1)− g horizontal 0 restart

"

#

$$

%

$$

Ø The optimal local alignments can be identified by the cells in the alignment matrix that has the maximal score.

Ø The alignment is recovered by backtracking starting from this cell until a zero is encountered. Of course, the track of how H(i,j) is computed needs to be stored in another matrix.

Ø As in the case of global alignment, this algorithm has the time complexity of O(mn) .

Page 5: Lecture 12: Sequence alignment algorithms 2

Smith-Waterman local alignment algorithm

0

0 0

0 0

0 0 0 0 0 0 :

... ... :

1

2

1

0

1210

a... ...

a a

... ... a

a... ... aa

bbbbbbb

m

i

i

njj

)1,1( −− jiH ),1( jiH −

)1,( −jiH),( ji baS+

g−

g−),( jiH

Ø Initialize the first row and column, and then compute each cell from the upper left corner to the bottom right corner.

),( 11 bas+ g−

g− )1,1(H

Page 6: Lecture 12: Sequence alignment algorithms 2

Smith-Waterman local alignment algorithm Ø  Using our toy sequences,

we can compute the following alignment matrix using a linear gap penalty function, W(l) = -6l.

a: SHAKE and b: SPEARE,

Amino acid alignment scores taken from PAM250

SHAKE PEARE

Ø Through backtracking, we obtain the following local alignment:

3 1 1

1

Page 7: Lecture 12: Sequence alignment algorithms 2

Alignment algorithms with a general gap penalty Ø If the gap penalty function is in the general form W(l), when filling a

cell in the alignment matrix by a horizontal or vertical move, we need to determine the optimal value of l, i.e., how many spaces should have been inserted before the current one.

Ø The recursion relation for the global alignment is given by,

Ø To compute a cell, up to i+j+1 calculations have to be made, which have a time complexity O(n+m) or O(n).

Ø Therefore the entire algorithm run in O(n3), which is considered to be too slow, though it is a polynomial algorithm.

⎪⎪

⎪⎪

−−

−−

+−−

=

≤≤

≤≤

horizontal )](),([max

cal verti)}](),([max diagonal ),()1,1(

max),(

1

1

lWljiH

lWjliHbaSjiH

jiH

jl

il

ji

with the initial condition H(0, 0)=0, H(i, 0) = W(i) and H(0, j) = W(j).

Page 8: Lecture 12: Sequence alignment algorithms 2

Alignment algorithms with a general gap penalty Ø  The recursion relation for the local alignment version is,

Ø To avoid too many adjacent short gaps separated by too short alignments, the following relation must be hold,

⎪⎪

⎪⎪

−−

−−

+−−

=≤≤

≤≤

restart 0

horizontal ] )(),([max

rtical ve] )(),([maxdiagonal ),()1,1(

max),(1

1

lWljiH

lWjliHbaSjiH

jiHjl

il

ji

with the initial condition H(i, 0)= H(0, j)= 0.

).()()( 2121 lWlWllW +≤+

That is, the penalty for a long gap should not be greater than the penalty for two short ones that add up to the same length.

In general, this relation holds if the open gap penalty is larger than any mismatch and gap extension, ie, S(a,b) > - gext >- gopen.

m l2 s l1 m l1+l2 Favorable alignment

Unfavorable alignment

Page 9: Lecture 12: Sequence alignment algorithms 2

Alignment algorithms with affine gap penalty Ø Because using a general form of gap penalty function slows down the

algorithm, an affine gap penalty function is preferred. Ø When using affine gap penalty function in dynamic programming, we

only need to differentiate between the case that the gap is first being introduced and the case that the gap is being extended.

Ø Let M(i,j) be the score of the best alignment up to the i-th letter in a and j-th letter in b, and ai is aligned with bj.

Ø Let I(i,j) be the score of the best alignment up to the i-th letter in a and j-th letter in b, and ai is aligned with a space.

i-1 j-1

_i j

),( jiM

i-1 j

i ),( jiI

a b

a b

Page 10: Lecture 12: Sequence alignment algorithms 2

Alignment algorithms with affine gap penalty

Ø  Therefore, we need to fill four separate matrices. Ø  To compute M (i,j), we consider the following three possibilities,

Ø Let J (i,j) be the score of the best alignment up to the i-th letter in a and j-th letter in b, and bj is aligned with a space.

i j-1 j

),( jiJ

Ø  The score of the best alignment to this point is the best of the three cases, )).,(),,(),,(max(),( jiJjiIjiMjiH =

i j ),(

)1,1(),(

1 jbaSjiMjiM

+

−−=i-1 j-1

i j ),(

)1,1(),(

1 jbaSjiIjiM

+

−−=i-1

a b

i j ),(

)1,1(),(

1 jbaSjiJjiM

+

−−=

i-2 j-2

a b

i-2 j-1

a b

i-1 j-2

a b j-1

ai-1 aligns with bj-1, so we extend an alignment

ai-1 aligns with a space, so we end a gap.

bi-1 aligns with a space, so we end a gap.

Page 11: Lecture 12: Sequence alignment algorithms 2

Alignment algorithms with affine gap penalty Ø  Therefore, we have the following recursion to compute M(i,j),

⎪⎩

⎪⎨

+−−

+−−

+−−

=

),()1,1(

),()1,1(

),()1,1(

max),(

1

1

1

j

j

j

baSjiJ

baSjiI

baSjiM

jiM

Ø  Therefore, we have the following recursion to compute I(i, j),

Ø  To compute I(i,j), we consider the following two possibilities, i

opengjiMjiI −−= ),1(),(i-1 j

i extgjiIjiI −−= ),1(),(i-1

ai-1 aligns with bj, so we open a gap.

ai-1 aligns with a space, so we extend the gap.

i-2 j-1

a b

i-2 j

a b

⎩⎨⎧

−−

−−=

ext

open

gjiI

gjiMjiI

),1(

),1(max),(

Ø We do not need to consider the following case, as it never happens by design (a high cost for opening a gap).

i j

bj aligns with a space i-1 j-1

a b

Initialization: M(0,0)= 0, M(0, j)= -∞ and M(i,0)= -∞

Initialization: I(0,0)=-∞, I(0,j)=-∞, I(1,j)= -gopen, I(i,0)=-gopen - gext (i-1)

Page 12: Lecture 12: Sequence alignment algorithms 2

Alignment algorithms with affine gap penalty

Ø  Therefore, we have the following recursion to compute J(i, j),

Ø Similarly, to compute J(i,j), we consider the following two possibilities,

j opengjiMjij −−= )1,(),(i j-1

j extgjiJjiJ −−= )1,(),(j-1

bi-1 aligns with ai, so, we open a gap.

bi-1 aligns with a space, so, we extent the gap.

i-1 j-2

a b

i j-1

a b

⎩⎨⎧

−−

−−=

ext

open

gjiJ

gjiMjiJ

)1,(

)1,(max),(

Ø We fill the dynamic programming matrix H(i,j) by using the recursive relation:

Initialization: J(0,0)= -∞, J(i, 0) = -∞, J(i, 1) = -gopen, J(0, j) = -gopen - gext (j-1)

H (i, j) =maxM (i, j) diagonalI(i, j) vertical J(i, j) horizontal

!

"#

$#

with the initial condition: H(0, 0)=0, H(i, 0)=-gopen -(i-1)gext, H(0, j)=- gopen –(j-1)gex.

Page 13: Lecture 12: Sequence alignment algorithms 2

Alignment algorithms with affine gap penalty Ø In this algorithm design, I or J type alignments can only be followed

by the same type of alignment or M type alignment, thus we prevent alignments where a space in one sequence is immediately followed by a space in another sequence, such as,

Ø To see this, consider the following relation by setting gopen to values large enough, so that,

S(a,b) > - gopen

Ø Therefore, alignment (I) always scores higher than (II), because S(H,E)>-gopen.

S--HAKE SPE-ARE

(I) S-HAKE SPEARE

(II)

S--HAKE SPE-ARE

Page 14: Lecture 12: Sequence alignment algorithms 2

Alignment algorithms with affine gap penalty Ø As before, by adding the restart option and initialization conditions to

this global alignment algorithm, we can produce an algorithm for local alignment.

⎪⎪⎩

⎪⎪⎨

=

restart 0 horizontal ),(

rtical ve),(diagonal ),(

max),(jiJjiIjiM

jiH

with the initial condition H(i, 0)= H(0, j)= 0.

Page 15: Lecture 12: Sequence alignment algorithms 2

Effect of scoring parameters on the alignment Ø Although dynamic

programming algorithm guarantee the optimal alignment between two sequences under the given scoring system, if we change the scoring system, different results will be resulted.

Ø  Shown left are alignments between human and yeast hexokinase proteins using different gap open penalties.

Page 16: Lecture 12: Sequence alignment algorithms 2

On line pairwise alignment programs Ø Needleman-Wunsch algorithm with an affine gap penalty function has

been implemented by a few groups and the programs are freely available in both standalones or web-based applications.

Ø Such as the needle program in the EMBOSS package by EMBO, and the GGSEARCH program in the FASTA package.

Ø Webserver of GGSEARCH:

http://www.ebi.ac.uk/Tools/fasta33/index.html?program=GGSEARCH

Page 17: Lecture 12: Sequence alignment algorithms 2

On line pairwise alignment programs Ø Smith-Waterman algorithm with an affine gap penalty function has

been also implemented by a few groups and programs are freely available in both standalones or web-based applications.

Ø Such as the the water program in the EMBOSS package, and the SSEARCH program in the FASTA package.

Ø Web server of SSREACH: http://www.ebi.ac.uk/Tools/fasta33/index.html?program=SSEARCH

Page 18: Lecture 12: Sequence alignment algorithms 2

Multiple sequence alignment

Ø Multiple sequence alignment (MSA): alignment of more than three sequences, such that the alignment have the maximal score given a scoring matrix and gap penalty function.

Ø  Theoretically, MSA can be solved by multidimensional dynamic programming, however it has time complexity O(NS) for aligning S sequences of length N, so it can only be applied to few sequences.

Ø  In fact, it has been shown that MAS is a NP-hard problem, therefore, there is no known efficient algorithm to solve it.

Ø Various heuristic algorithms have been proposed to align multiple sequences. They generally perform well when the sequences to be aligned are not too distantly related to one another.

Ø  The most of these heuristic algorithms, such as the Clustal X/W and OMEGA algorithms, use a progressive alignment method to align multiple sequences.

Page 19: Lecture 12: Sequence alignment algorithms 2

Progressive alignment algorithm Ø This method starts with the most confident pairwise alignment, and

then gradually add each sequence or groups of sequences to the already aligned MSA using a guide tree.

Ø For example, the Clustal algorithms first construct a phylogenetic tree of the sequences to be aligned using the pairwise alignment of the sequences.

Ø The evolutionary distance between two sequences can be estimated by the Kimura estimator,

d = -ln(1-D-0.2D2)

Ø Evolutionary distance can be also calculated from the alignment score using Feng and Doolittle formula,

)].ln()[ln(100 randidentrand SSSSd −−−−=

where Srand is the average score to align two random sequences, Sident is the average score to align two identical sequences.

Page 20: Lecture 12: Sequence alignment algorithms 2

Progressive alignment algorithm Ø Clustal uses the neighbor-joining

method to construct the tree using the computed evolutionary distance matrix.

Ø Using the tree as a guide, Clustal aligns the two pairs of sequences that has shortest evolutionary distance, i.e., HXK2 RAT and HXK2 HUMAN, and HXK1 RAT and HXK1 HUMAN, using the Needleman-Wunsch global alignment algorithm.

Ø This two pairwise alignments are then aligned to form a cluster of four sequences.

Ø This process is repeated until all clusters are joined to form a single cluster.

Page 21: Lecture 12: Sequence alignment algorithms 2

Alignment algorithm of two clusters of sequences Ø The algorithm for

aligning two clusters of sequences are essentially the same as aligning two sequences, but all the sequences in a cluster are treated as if they are a single sequence.

Ø If a space is introduced in the cluster, it will be inserted at the same position in all sequences in the cluster.

Page 22: Lecture 12: Sequence alignment algorithms 2

Alignment algorithm of two clusters of sequences Ø To score the aligned sites i and j in two clusters, we can use the

average of the individual scores for the amino acid pairs that can be formed between the clusters

Ø For example, the score for aligning the following sites from two clusters,

would be

P A I R

Position i in cluster 1

Position j in cluster 2

)].,(),(),(),([221)](2),(1[ RASIASRPSIPSjclustericlusterS +++×

=

,)](),([1)](2),(1[1 2

1 121

21∑∑= =

=n

k

n

t

tk jaiaSnn

jclustericlusterS

where n1 and n2 are the number of sequences, and ak1(i) and at

2(j) are the amino acids at the sites i and j in the k-th and t-th sequences in clusters 1 and 2, respectively.

Page 23: Lecture 12: Sequence alignment algorithms 2

Problems of progressive alignment algorithms Ø Because the heuristic nature of the progressive multiple algorithm,

global optimal is not guaranteed.

Ø In particular, the errors made by an earlier step cannot be corrected by the later steps.

Ø To avoid the bias caused by aligning very closed-related sequences in the earlier steps, Clustal uses a weighted scoring system, i.e., smaller weight is given to closed related sequences when computing the alignment scores.

Ø Sometimes, manual adjustment is needed in the regions that are not aligned very well.

,)](),([)](2),(1[1 2

1 121

21

2,1 ∑∑= =

=n

k

n

t

tk jaiaSnnw

jclustericlusterS

Page 24: Lecture 12: Sequence alignment algorithms 2

Online multiple sequence alignment programs Ø The popular multiple sequence alignment programs include:

1.  Clustal X/W/OMEGA: http://www.clustal.org/

2.  T-Coffee: http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html

3.  MUSCLE: http://www.drive5.com/muscle/

Ø Since MSA algorithms are still an active research area, new algorithms and programs are expected in the future.

Ø A recent development is the SATe algorithm, which iteratively constructs the guide tree and the alignment until a convergent criterion is met: http://people.ku.edu/~jyu/sate/sate.html