Chapter 2

Chapter 2

Pairwise Alignment

Pairwise Alignment

• Ask if two sequences are related

• First align the sequences (or parts of them) and then decide whether that alignment is more likely to have occurred because the sequences are related, or just by chances

Sequence Alignment

• Definition: Procedure for comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences– Pair-wise alignment: compare two

sequences– Multiple sequence alignment: compare

more than two sequences

Example sequence alignment

• Task: align “abcdef” with “abdgf”• Write second sequence below the first

abcdefabdgf

• Move sequences to give maximum match between them

• Show characters that match using the identical letter


abcdefababdgf

• Insert gap between b and d on lower sequence to allow d and f to align


abcdefab d fab-dgf


abcdefab d fab-dgf

• Note e and g don’t match

Matching Similarity vs. Identity

• Alignments can be based on finding only identical characters, or (more commonly) can be based on finding similar characters

• More on how to define similarity later

Global vs. Local Alignment

• We distinguish– Global alignment algorithms which optimize

overall alignment between two sequences – Local alignment algorithms which seek only

relatively conserved pieces of sequence• Alignment stops at the ends of regions of strong

similarity• Favors finding conserved patterns in otherwise

different pairs of sequences

Global vs. Local Alignment

• Global

LGPSSKQTGKGS-SRIWDN

L k GKG R D

LN-ITKSAGKGAIMRLGDA• Local

--------GKG--------

GKG

--------GKG--------

Why do sequence alignments?

• To find whether two (or more) genes or proteins are evolutionarily related to each other

• To find structurally or functionally similar regions within proteins

Key Issues

• What sorts of alignment should be considered

• The scoring system used to rank alignments

• The algorithm used to find optimal (or good) scoring alignments

• The statistical methods used to evaluate the significance of an alignment score

Example

• The following figure shows an example of

three pairwise alignments, all to the same

region of the human alpha globin protein

sequence (SWISS-POTR database

identifier HBA_HUMAN).

• Identical positions with letters, and ‘similar’ positions with a plus (+) sign

Example

• In the first alignment, there are many “matches”; many others are functionally conservative (D-E towards the end)

• The second alignment shows a biologically meaningful alignment (evolutionarily related, the same 3D structure, and same function in oxygen binding); many fewer identities

• The third alignment has a similar number of identities or conservative changes; A spurious alignment to a protein that has a completely different structure and function

Challenges

• How to distinguish the second one from the third one?

• The determination of the scoring system is crucial

• It is difficult to distinguish true alignments from spurious alignments

The Scoring Model

• When comparing sequences, we look for evidence that they have diverged from a common ancestor by a process of mutation and selection

• Basic mutational processes– Substitutions: change residues in a sequence– Insertions and deletions: add or remove residues

• Insertions and deletions are referred to as “gaps”

• The total score assigned to an alignment will be a sum of terms for each aligned pair of residues, plus terms for each gap

The Scoring Model

• We expect identities and conservative substitutions to be more likely in alignments than we expect by chance, and so to contribute positive score terms

• Non-conservative changes are expected to be observed less frequently in real alignments than we expect by chance, and so these contribute negative score terms

Assumption

• We can consider mutations at different sites in a sequence to have occurred independently

• This is reasonable for DNA and protein sequences

• The interactions between residues also play a very critical role

• Long range dependencies should be considered for structural RNAs

Substitution Matrices

• Consider a pair of sequences, x and y, of lengths n and m

• Let xi be the ith symbol in x and yj be the jth symbol in y

• These symbols come from some alphabet A; in the case of DNA this will be the four bases {A, G, C, T}, and in the case of proteins the twenty amino acids

• We will only consider ungapped global pairwise alignments, i.e., two completely aligned equal-length sequences

Rationale

• Given a pair of aligned sequences, we want to assign a score to the alignment that gives a measure of the relatively likelihood that the sequences are related as opposed to being unrelated

• Assign a probability to the alignment in each of the two cases

• We consider the ratio of the two probabilities

Unrelated or Random Model

• Let R be the unrelated model• The letter a occurs independently with some fr

equency qa, and hence the probability of the two sequences is just the product of the probabilities of each amino acid:

j

yi

x jiqqRyxP )|,(

Alternative Match Model

• Let M be the alternative match model• Aligned pairs of residues occur with a joint proba

bility pab

• A probability for the whole alignment is

i

yx iipMyxP )|,(

The Odds Ratio

The ratio of these two likelihoods is known

as the odds ratio:

i yx

yx

i iyx

iyx

ii

ii

ii

ii

qq

p

qq

p

RyxP

MyxP

)|,(

)|,(

The Log Odds Ratio

We take the logarithm of the odds ratio:

is the log likelihood ratio of the residue pair (a, b)

occurring as an aligned pair, as opposed to an

unaligned pair

)log(),(

where

),(

ba

ab

iii

qq

pbas

yxsS

Substitution Matrices

• The s(a,b) scores can be arranged in a matrix• For proteins, they form a 20X20 matrix (score m

atrix or substitution matrix)• Using BLOSUM50 matrix, the first alignment get

s a score of 130• PAM matrices• Any substitution matrix is making a statement ab

out the probability of observing ab pairs in real alignment

A R N D C Q E G H I L K M F P S T W Y V

A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0

R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3

N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3

D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4

C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1

Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3

E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3

G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4

H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4

I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4

L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1

K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3

M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1

F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1

P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3

S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0

W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3

Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1

V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5

BLOSUM50

A C D E F G H I K L M N P Q R S T V W Y A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 AlaC 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 CysD -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 AspE -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 GluF -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 PheG 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 GlyH -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 HisI -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 IleK -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 LysL -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 LeuM -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 MetN -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2 AsnP -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3 ProQ -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 GlnR -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2 ArgS 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2 SerT 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2 ThrV 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 ValW -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2 TrpY -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7 Tyr Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Henikoff, S. and Henikoff, J.G. (1992) Proc. Nat. Acad. Sci. USA 89, 19915-10919.

BLOSUM62BLOSUM62BLOSUM62BLOSUM62

Gap Penalties

The standard cost associated with a gap of length g is given by a linear score

where d is called the gap-open penalty and e is called the gap-extension penalty.

egdg

gdg

)1()(

score affinean or

)(

Gap Penalties

• The gap-extension penalty e is usually set to something less than the gap-open penalty d, allowing long insertions and deletions to be penalized less than they would be by the linear gap cost

• This is desirable when gaps of a few residues are expected almost as frequently as gaps of a single residue

Gap Probability

The probability of a gap occurring at a particular site in agiven sequence is

qa probabilities are the same as those used in the random model. When we divide by the probability of this region according to the

random model to form the odds ratio, the qxi terms cancel out, so we are left only with a term dependent on length γ(g)=log(f(g)); i.e., gap penalties correspond to the log probability of a gap of that length.

gapin

)()gap(i

xiqgfP

Alignment Algorithms• Given a scoring system, we need to have an

algorithm for finding an optimal alignment for a pair of sequences

• While both sequences have the same length n, there is only one possible global alignment of the complete sequences

• When gaps are allowed, there are

possible global alignments between two sequences of length n

nn

n

n

n n

2

2

)!(

)!2(2 2

2

Example

(1) ab (2) ab- (3) ab- (4) -ab

cd -cd c-d cd-

(5) ab-- (6) -ab-

--cd c--d

Dynamic Programming

• Guarantee to find the optimal scoring alignment or set of alignments

• Central to computational sequence analysis

• Maximize the score to find the optimal alignment

Example

• We wish to align two short amino acid sequences: HEAGAWGHEE & PAWHEAE

• We use the BLOSUM50 score matrix, and a gap cost per unaligned residue of d=-8

H E A G A W G H E E

P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1

A -2 -1 5 0 5 -3 0 -2 -1 -1

W -3 -3 -3 -3 -3 15 -3 -3 -3 -3

H 10 0 -2 -2 -2 -3 -2 10 0 0

E 0 6 -1 -3 -1 -3 -3 0 6 6

A -2 -1 5 0 5 -3 0 -2 -1 -1

E 0 6 -1 -3 -1 -3 -3 0 6 6

Global Alignment: Needleman-Wunsch Algorithm

• Construct a matrix F indexed by i, j, one index from each sequence

• F(i, j) is the score of the best alignment between the initial segment x1…i of x up to xi and the initial segment y1…j of y up to yj

• Begin F(0,0)=0, we then fill the matrix from top left to bottom right

• If F(i-1, j-1), F(i-1, j), and F(i, j-1) are known, it is possible to calculate F(i, j)

Three Ways of Alignments

• xi is aligned to yj

IGA xi

LGVyj

• xi is aligned to a gap

AIG A xi

GVyj - -• yj is aligned to a gap

GA xi - -

SLG Vyj


• xi is aligned to yj, F(i, j)= F(i-1, j-1)+s(xi, yj)

• xi is aligned to a gap, F(i, j)= F(i-1, j)-d

• yj is aligned to a gap, F(i, j)= F(i, j-1)-d


djiF

djiF

yxsjiF

jiFji

)1,(

,),1(

),,()1,1(

max),(

The Diagram

F(i-1,j-1) F(i,j-1)

s(xi,yj) -d

F(i-1,j) F(i,j)

-d

The F Matrix

• As we fill in the F(i, j), we also keep a pointer in each cell back to the cell from its F(i, j) was derived

• Along the top row, where j=0, the values F(i, j-1) and F(i-1, j-1) are not defined, so the values F(i, 0) must be handled specially

• The values F(i, 0) represent alignments of a prefix of x to all gaps in y, so we can define F(i, 0) =-id. Likewise, F(0, j)=-jd

• F(n, m) is the best score for an alignment of x1…n

to y1…m

The Global Dynamic Programming Matrix

Local Alignment: Smith-Waterman Algorithm

• In previous section, we know which sequences we want to align, and we are looking for the best match between them from one end to the other

• Most often, we are looking for the best alignment between subsequence of x and y

• When it is suspected that two protein sequence may share a common domain, or when comparing extended sections of genomic DNA sequence

Local Alignment

• It is the most sensitive way to detect similarity when comparing two very highly diverged sequences, even if they may share evolutionary origin

• In this case, only part of the sequence has been under strong enough selection to preserve detectable similarity; the rest will have accumulated so much noise through mutation that it is no longer alignable

• The highest scoring alignment of subsequences of x and y is called the best local alignment

The Algorithm

• The algorithm is closely related to that for global alignments

However, there are two differences.• In each cell in the table, F(i, j) is allowed to

take 0 if all other options have values less than 0

• An alignment can end anywhere in the matrix

The Algorithm

)1,(

,),1(),,()1,1(

,0

max),(

djiF

djiFyxsjiF

jiFji

The First Difference

• Taking the option 0 corresponds to starting a new alignment

• If the best alignment up to some point has a negative score, it is better to start a new one, rather then extend the old one

• The top row and left column will be filled with 0s, not –id and –jd as for global alignment

The Second Difference

• We look for the highest value of F(i, j) over the whole matrix, and start the traceback from there

• The traceback ends when we meet a cell with value 0, which corresponds to the start of the alignment

Example

Example

• Note that the local alignment is a subset of the global alignment, but that is not always the case

Global Alignment

HEAGAWGHE-E

– –P –AW –HEAE

Local Alignment

AWGHE

AW –HE

Local Alignment

• When considering local alignment, the expected score for a random match must be negative

• If this is not true, then long matches between entirely unrelated sequences will have high scores, just based on their length

• Some s(a, b) must be greater than 0, otherwise the algorithm won’t find any alignment at all

Expected Score

• Assume that there is no gap and successive positions are independent

• The expected score of a fixed length alignment can be evaluated by

where qa is the probability of a symbol a at any given position in a sequence.

),,(,

basqqba

ba

Expected Score

When s(a, b) is derived as a log likelihood ratio,

using the same qa as for the random model

probabilities, we will have

0log),(,,

ab

bab

baa

baba p

qqqqbasqq

Repeated Matches

• If one or both of the sequences are long, there may exist many different local alignments with a significant score, and in most cases we would be interested in all of these

• For example, there may have many copies of a repeated domain or motif in a protein

• We want to find one or more non-overlapping copies of sections of one sequence in the other

Repeated Matches

• We are interested in matches scoring higher than T

• Let y be the sequence containing the domain or motif, and x be the sequence in which we are looking for multiple matches

The Algorithm

• The meaning and the recurrence of F(i, j) are different

• In the final alignment, x will be partitioned into regions that match parts of y in gapped alignments, and region that are unmatched

• F(i, j) for j 1 is the best sum of match≧ scores to x1…i , assuming that xi is in a matched region, and the corresponding match ends in xi and yj

• F(i, 0) is the best sum of completed match scores to the subsequence x1…i , i.e. assuming that xi is in an unmatched region

The Algorithm

)1,(

,),1(

),,()1,1(

),0,(

),(

;,...,1 ,),1(

),0,1(max)0,(

,0)0,0(

djiF

djiF

yxsjiF

iF

jiF

mjTjiF

iFiF

F

ji

The Algorithm

• F(i, 0) handles unmatched regions and ends of matches, only allowing matches to end when they have at least T

• F(i, j) handles starts of matches and extensions• The total score of all matches is derived by

adding an extra cell to the matrix, F(n+1, 0)• This score will have T subtracted for each

match; if there were no matches of score greater than T, it will b 0

The Algorithm

• The individual match alignments can be obtained by tracking back from cell (n+1,0) to (0,0), at each point going back to the cell that was the source of the score in the current max() operation

• This traceback procedure is a global procedure, showing what each residue in x will be aligned to

Example

Remark

• The algorithm obtains all the local matches in one pass

• It finds the maximal scoring set of matches, in the sense of maximizing the combined total of excess of each match score above the threshold T

• Changing the value of T changes what the algorithm finds

• Increasing T may exclude matches

Remark

• Decreasing it may split them, as well as finding new weaker ones

• A locally optimal match in the sense of the preceding section will be split into pieces if it contains internal subalignment scoring less than -T

Overlap Matches

• We may expect that one sequence contains the other, or that they overlap

• It occurs when comparing fragments of genomic DNA sequence to each other, or to larger chromosomal sequences

• We do not penalize overhanging ends• We want a match to start on the top or left

border of the matrix, and finish on the right or bottom border

Example

The Algorithm

• F(i, 0)=0 for i=1,…,n and F(0, j)=0 for j=1,…,m• F(i, j) will be the same as the one in global align

ment

• We set Fmax to be the maximum value on the right border (i,m), i=1,…,n, and the bottom border (n,j), j=1,…,m

• The traceback starts from the maximum point and continues until the top or left edge is reached

Example

More Complex Models

• Previously we only consider the gap score to be a simple multiple of the length (γ(g)=-jd)

• This type of scoring is not ideal for biological sequence: it penalizes additional gap steps as much as the first

• We should penalize more for additional gap steps

• When gaps occur, they are often longer than one residue

More Complex Models

• We may use a general function for γ(g)

• This will require intensive computation

1,...,0 ),()1,(

1,...,0 ),(),1(

),,()1,1(

max),(

jkkjjiF

ikkijiF

yxsjiF

jiFji

More Complex Models

• This procedure now requires operations to align two sequences of length

• In each cell (i,j) we have to look at i+j+1 potential precursor, not just three as previously

2

)1()1(1

0 0

nmmnmnmnji

n

i

m

j

)( 3nO

More Complex Models

• Prohibitively costly increase in computational time in many case.

• Under some conditions computational time to ,although the constant of proportionality is higher in these case.

• In each cell have to look at 2K+1 potential precursors.

)( 2nO

12120 0

knmkn

i

m

j

Alignment with Affine Gap Scores

• γ(g)= -d-(g-1)e

d is the gap open penalty

e is the gap-extension penality

• O(n2)

• We now have to keep track of multiple values for each pair of residue coefficients (i,j) in place of the single value F(i,j)

Let M(i,j) be the best score up to (i,j) given

that xi is aligned to yj (left case), Ix(i,j) be the

best score given that xi is aligned to a gap (in an insertion w.r.t y, central case), and

Iy(i,j) be the best score given that yj is aligned to a gap (in an insertion w.r.t. x, right case).

Assumption

• We assume that a deletion will not be followed directly by an insertion

• This will be true for the optimal path if (-d-e) is less then the lowest mismatch score

The Recurrence Relations

The Diagram of the Relationships

The Interpretation

• The transitions each carry a score increment, and the stats each specify a ∆(i,j) pair, which is used to determine the change in indices i and j when that state is entered

• The new value for a state variable at (i,j) is the maximum of the scores corresponding to the transitions coming into the state

The Interpretation

• Each transition score is given by the value of the source state at the offsets specified by the ∆(i,j) pair of the target state, plus the specified score increment

• This type of description corresponds to a finite state automaton (FSA) in computer science

Alignment

An alignment corresponds to a path through

the states, with symbols from the underlying

pair of sequences being transferred to the

alignment according to the ∆(i,j) values in

the states.

The Algorithm

• Transitions to state M indicate letter-to-letter correspondences, so they are labeled with s(xi, yj) corresponding to the substitution score for replacing xi with yj

• We label every transition from M to a gap state (Ix or Iy) with the gap initiation penalty –d

• we label each transition from a gap state to itself with the gap extension penalty –e

Initialization

• M(0,0) = 0• Ix(i,0) = d + i * e• Iy(0,j) = d + j* e• The optimal alignment given by max [ M(n, m), Ix

(n, m), Iy(n, m)• Every path through this model corresponds to an

alignment• If we sum every score on the transitions of a pat

h, we will have the same score as a global alignment dynamic programming problem with affine gap penalties

Example

• x3 has been matched to y5 and we are in state M

• Now x4 has to be matched and it is best to match it to a gap on y

• The current state will change from M to Ix and the penalty of d (gap open) must be paid

• If x5 is also assigned to a gap on y, the penalty of e (gap extend) must be paid

• If then x6 is assigned to y6, the state will change back to M and the cost of s(x6,y6) will be added

Example


• It is in fact frequent practice to implement an affine gap cost algorithm using only two states, M and I.


• This is only guaranteed to provide the correct result if the lowest mismatch score is >= -2e

• For those interested in pursuing the subject, the simpler state-based automata are called Moore machine, and the transition-emitting systems are called Mealy machines

More Complex FSA Models

• Four-state FSA with two match states

• There may be high fidelity regions of alignment without gaps, corresponding to match state A

More Complex FSA Models

• Separated by lower fidelity regions with gaps, corresponding to match state B and gap states Ix and Iy

• Given an alignment path, there is also an implicit attachment of labels to the symbols in the original sequences, indicating which state was used to match them

Exercise 2.10

• Calculate the score of the example alignment in Figure 2.10, with d=12, e=2

Heuristic Alignment Algorithms

• Using dynamic programming to compute similarity between two sequences will cost O(n*m)

• With this cost, aligning a sequence against a database containing millions of sequences is not feasible

• The current protein database contains of the order of 10^8 residues, so far a sequence of length 10^3, approximately 10^11 matrix cells must be evaluated to search the complete database


• The current protein database contains of the order of 10^8 residues, so far a sequence of length 10^3, approximately 10^11 matrix cells must be evaluated to search the complete database

• At ten million matrix cells a second, which is reasonable for a single workstation at the time this is being written, this would take 104 seconds, or around 3 hours

• If we want to search with many different sequences, time rapidly becomes an important issue


• The goal of these method is to search as small a fraction as possible of cells in dynamic programming matrix, while still looking at all the high scoring alignment

• For the scoring matrices used to find distant matches, that exact methods become intractable, and we must use heuristic approaches that sacrifice some sensitivity


• Two of the best-known algorithms are FASTA and BLAST

• They work from the same basic idea, namely that – most sequences in the database don’t match, and because of that the algorithms apply some heuristic to exclude many of the unrelated sequences

• If these heuristics are used there will be no guarantee that we will find the optimal scoring alignment

General Approach

• Seeds: All (good ungapped matches) to subsequences of X of a given length(X is our query)

• For each sequence Y in the database 1. Search for seeds in Y 2. Extend alignment around seeds or partition the alignment problem 3. If a high scoring match in last step was found, then use dynamic programming around a good match

FastaFast Alignment

• Seeds: All subsequences of length ktup

• Typical values of ktup are

- Proteins: 1-2

- DNA: 4-6

• Store all seeds with their starting position in X in a hashtable

Steps

1. Using the hashtable we can now find all exact matches(hotspots) to seeds in Y. The running time will be linear in the length of Y: O(|Y|)

2. Chain hotspots into runs of hotspots on the same diagonal. This is done efficiency by sorting hotspots on (j-i). A run consists of one or more consecutive hotspots a long a common diagonal

The Steps

(2-a) Using a function that takes a set of hotspots and their distance to each other, we can find out how good the runs are. Score(run) = δ(hotspot, distance)

(2-b) Pick 10 high scoring runs:

R1 = (α1, β1) …….

R10 = (α10, β10)

(2-c) If max{S(αi, βi)} 1 ≤ i ≤ 10 is sufficient high we continue

The Steps

(2-d) Now we can construct a weighted graph. If the longest path in this graph is sufficiently high we can continue to next step. The length of a path is the sum of the weights on the edges and the weights on the vertices of the path

The Steps

3. Perform banded dynamic programming around the high scoring run. The running time of this step is O( |X| + |Y| ), because ‘c’ is limited by a constant

BLASThttp://www.ncbi.nlm.nih.gov/BLAST/

Basic Local Alignment Search Tool

• The package provides programs for finding high scoring local alignment between a query sequence and a target database

• The idea is that true match alignments are vary likely to contain somewhere within them a short stretch of identities, or very high scoring matches.

http://www.ncbi.nlm.nih.gov/BLAST/

BLAST

• Look initially for such short stretches and use them as ‘seeds’, from which to extend out in search of a good longer alignment.

• By keeping the seed segments short, it is possible to pre-process the query sequence to make a table of all possible seeds with their corresponding start point.

BLAST

• Make a list of all ‘neighbourhood words’ of a fixed length

• Scan through the database->whenever it finds a word in the set -> starts a ‘hit extension’ process to extend the possible match as an ungapped alignment in both directions->stopping at a maximum scoring extension

BLAST

• Only find ungapped alignments

• restricting to ungapped alignments misses only a small proportion of significant matches.

• Can find and report more than one high scoring match per sequence pair and can give significance values for combined scores.

BLAST

• Word: subsequence of length d• Typical values of d are -Protein: 3 -DNA: 11 • Seeds in BLAST are longer than seeds in

FASTA, but it isn’t necessary to find exact matches

• Seeds: All words w’ such that s (w, w’) ≥ t for some word w in X

The Steps

1. Find exact matches to seeds in Y

2. Extend each match to maximal extension and report matches with score > c

Linear Space Alignments

• We want to save the memory usage for computational resource

• All the algorithms described so far calculate score matrices such as F(i,j), which have overall size nm

• For two protein sequences, of typical length a few hundred residues, this is well within the capacity of modern desktop computers

• If one or both of the sequence is a DNA sequence tens or hundreds of thousands of bases long, the required memory for the full matrix can exceed a machine’s physical capacity

Linear Space Alignments

• To build the matrix take O(n2) space, this can be too much on larger n

• If we are only interested in the max score, there is no need for storing more than the current row j and the last row j-1

• Every value (i,j) of the current row is calculated from the predecessor (i-1,j) and from values of the last row

• By doing so, only linear space is required but we also miss the possibility to backtrack and so to find the max scoring sequence

Divide and Conquer Algorithm

• We halve the problem all the time

• The optimal alignment for the whole matrix will be the concatenation of the optimal alignments of the two submatrices

• These can again be calculated recursively either until the length is zero or until the length is so low, that is can be calculated with the standard O(n2) algorithm

The Approach

• We want to divide the matrix at the entry (i,j)• We set v=[m/2] and now we have to find the

column u so that the entry (u,v) is part of the optimal alignment for the whole matrix

• For each cell (i,j) with j>v, we store not only F(i,j) but also the column where the optimal path to (i,j) crossed row v

• Thus we can obtain the column u and so the cell (u,v) once we reach the entry (n,m)

Example

• F(2,2) has been calculated and at the same time, the parent is set to (1,1)

• If the score of F(2,3) is calculated from the score of F(2,2), the parent of (2,3) will be set to the parent of (2,2) i.e. (1,1)

Chapter 2

Documents

Transcript of Chapter 2