Chapter 2

114
Chapter 2 Pairwise Alignment

description

Chapter 2. Pairwise Alignment. Pairwise Alignment. Ask if two sequences are related First align the sequences (or parts of them) and then decide whether that alignment is more likely to have occurred because the sequences are related, or just by chances. Sequence Alignment. - PowerPoint PPT Presentation

Transcript of Chapter 2

Page 1: Chapter 2

Chapter 2

Pairwise Alignment

Page 2: Chapter 2

Pairwise Alignment

• Ask if two sequences are related

• First align the sequences (or parts of them) and then decide whether that alignment is more likely to have occurred because the sequences are related, or just by chances

Page 3: Chapter 2

Sequence Alignment

• Definition: Procedure for comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences– Pair-wise alignment: compare two

sequences– Multiple sequence alignment: compare

more than two sequences

Page 4: Chapter 2

Example sequence alignment

• Task: align “abcdef” with “abdgf”• Write second sequence below the first

abcdefabdgf

• Move sequences to give maximum match between them

• Show characters that match using the identical letter

Page 5: Chapter 2

Example sequence alignment

abcdefababdgf

• Insert gap between b and d on lower sequence to allow d and f to align

Page 6: Chapter 2

Example sequence alignment

abcdefab d fab-dgf

Page 7: Chapter 2

Example sequence alignment

abcdefab d fab-dgf

• Note e and g don’t match

Page 8: Chapter 2

Matching Similarity vs. Identity

• Alignments can be based on finding only identical characters, or (more commonly) can be based on finding similar characters

• More on how to define similarity later

Page 9: Chapter 2

Global vs. Local Alignment

• We distinguish– Global alignment algorithms which optimize

overall alignment between two sequences – Local alignment algorithms which seek only

relatively conserved pieces of sequence• Alignment stops at the ends of regions of strong

similarity• Favors finding conserved patterns in otherwise

different pairs of sequences

Page 10: Chapter 2

Global vs. Local Alignment

• Global

LGPSSKQTGKGS-SRIWDN

L k GKG R D

LN-ITKSAGKGAIMRLGDA• Local

--------GKG--------

GKG

--------GKG--------

Page 11: Chapter 2

Why do sequence alignments?

• To find whether two (or more) genes or proteins are evolutionarily related to each other

• To find structurally or functionally similar regions within proteins

Page 12: Chapter 2

Key Issues

• What sorts of alignment should be considered

• The scoring system used to rank alignments

• The algorithm used to find optimal (or good) scoring alignments

• The statistical methods used to evaluate the significance of an alignment score

Page 13: Chapter 2

Example

• The following figure shows an example of

three pairwise alignments, all to the same

region of the human alpha globin protein

sequence (SWISS-POTR database

identifier HBA_HUMAN).

• Identical positions with letters, and ‘similar’ positions with a plus (+) sign

Page 14: Chapter 2
Page 15: Chapter 2

Example

• In the first alignment, there are many “matches”; many others are functionally conservative (D-E towards the end)

• The second alignment shows a biologically meaningful alignment (evolutionarily related, the same 3D structure, and same function in oxygen binding); many fewer identities

• The third alignment has a similar number of identities or conservative changes; A spurious alignment to a protein that has a completely different structure and function

Page 16: Chapter 2

Challenges

• How to distinguish the second one from the third one?

• The determination of the scoring system is crucial

• It is difficult to distinguish true alignments from spurious alignments

Page 17: Chapter 2

The Scoring Model

• When comparing sequences, we look for evidence that they have diverged from a common ancestor by a process of mutation and selection

• Basic mutational processes– Substitutions: change residues in a sequence– Insertions and deletions: add or remove residues

• Insertions and deletions are referred to as “gaps”

• The total score assigned to an alignment will be a sum of terms for each aligned pair of residues, plus terms for each gap

Page 18: Chapter 2

The Scoring Model

• We expect identities and conservative substitutions to be more likely in alignments than we expect by chance, and so to contribute positive score terms

• Non-conservative changes are expected to be observed less frequently in real alignments than we expect by chance, and so these contribute negative score terms

Page 19: Chapter 2

Assumption

• We can consider mutations at different sites in a sequence to have occurred independently

• This is reasonable for DNA and protein sequences

• The interactions between residues also play a very critical role

• Long range dependencies should be considered for structural RNAs

Page 20: Chapter 2

Substitution Matrices

• Consider a pair of sequences, x and y, of lengths n and m

• Let xi be the ith symbol in x and yj be the jth symbol in y

• These symbols come from some alphabet A; in the case of DNA this will be the four bases {A, G, C, T}, and in the case of proteins the twenty amino acids

• We will only consider ungapped global pairwise alignments, i.e., two completely aligned equal-length sequences

Page 21: Chapter 2

Rationale

• Given a pair of aligned sequences, we want to assign a score to the alignment that gives a measure of the relatively likelihood that the sequences are related as opposed to being unrelated

• Assign a probability to the alignment in each of the two cases

• We consider the ratio of the two probabilities

Page 22: Chapter 2

Unrelated or Random Model

• Let R be the unrelated model• The letter a occurs independently with some fr

equency qa, and hence the probability of the two sequences is just the product of the probabilities of each amino acid:

j

yi

x jiqqRyxP )|,(

Page 23: Chapter 2

Alternative Match Model

• Let M be the alternative match model• Aligned pairs of residues occur with a joint proba

bility pab

• A probability for the whole alignment is

i

yx iipMyxP )|,(

Page 24: Chapter 2

The Odds Ratio

The ratio of these two likelihoods is known

as the odds ratio:

i yx

yx

i iyx

iyx

ii

ii

ii

ii

qq

p

qq

p

RyxP

MyxP

)|,(

)|,(

Page 25: Chapter 2

The Log Odds Ratio

We take the logarithm of the odds ratio:

is the log likelihood ratio of the residue pair (a, b)

occurring as an aligned pair, as opposed to an

unaligned pair

)log(),(

where

),(

ba

ab

iii

qq

pbas

yxsS

Page 26: Chapter 2

Substitution Matrices

• The s(a,b) scores can be arranged in a matrix• For proteins, they form a 20X20 matrix (score m

atrix or substitution matrix)• Using BLOSUM50 matrix, the first alignment get

s a score of 130• PAM matrices• Any substitution matrix is making a statement ab

out the probability of observing ab pairs in real alignment

Page 27: Chapter 2

A R N D C Q E G H I L K M F P S T W Y V

A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0

R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3

N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3

D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4

C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1

Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3

E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3

G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4

H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4

I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4

L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1

K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3

M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1

F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1

P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3

S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0

W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3

Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1

V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5

BLOSUM50

Page 28: Chapter 2

A C D E F G H I K L M N P Q R S T V W Y A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 AlaC 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 CysD -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 AspE -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 GluF -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 PheG 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 GlyH -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 HisI -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 IleK -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 LysL -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 LeuM -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 MetN -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2 AsnP -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3 ProQ -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 GlnR -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2 ArgS 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2 SerT 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2 ThrV 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 ValW -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2 TrpY -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7 Tyr Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr Henikoff, S. and Henikoff, J.G. (1992) Proc. Nat. Acad. Sci. USA 89, 19915-10919.

BLOSUM62BLOSUM62BLOSUM62BLOSUM62

Page 29: Chapter 2

Gap Penalties

The standard cost associated with a gap of length g is given by a linear score

where d is called the gap-open penalty and e is called the gap-extension penalty.

egdg

gdg

)1()(

score affinean or

)(

Page 30: Chapter 2

Gap Penalties

• The gap-extension penalty e is usually set to something less than the gap-open penalty d, allowing long insertions and deletions to be penalized less than they would be by the linear gap cost

• This is desirable when gaps of a few residues are expected almost as frequently as gaps of a single residue

Page 31: Chapter 2

Gap Probability

The probability of a gap occurring at a particular site in agiven sequence is

qa probabilities are the same as those used in the random model. When we divide by the probability of this region according to the

random model to form the odds ratio, the qxi terms cancel out, so we are left only with a term dependent on length γ(g)=log(f(g)); i.e., gap penalties correspond to the log probability of a gap of that length.

gapin

)()gap(i

xiqgfP

Page 32: Chapter 2

Alignment Algorithms• Given a scoring system, we need to have an

algorithm for finding an optimal alignment for a pair of sequences

• While both sequences have the same length n, there is only one possible global alignment of the complete sequences

• When gaps are allowed, there are

possible global alignments between two sequences of length n

nn

n

n

n n

2

2

)!(

)!2(2 2

2

Page 33: Chapter 2

Example

(1) ab (2) ab- (3) ab- (4) -ab

cd -cd c-d cd-

(5) ab-- (6) -ab-

--cd c--d

Page 34: Chapter 2

Dynamic Programming

• Guarantee to find the optimal scoring alignment or set of alignments

• Central to computational sequence analysis

• Maximize the score to find the optimal alignment

Page 35: Chapter 2

Example

• We wish to align two short amino acid sequences: HEAGAWGHEE & PAWHEAE

• We use the BLOSUM50 score matrix, and a gap cost per unaligned residue of d=-8

Page 36: Chapter 2

H E A G A W G H E E

P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1

A -2 -1 5 0 5 -3 0 -2 -1 -1

W -3 -3 -3 -3 -3 15 -3 -3 -3 -3

H 10 0 -2 -2 -2 -3 -2 10 0 0

E 0 6 -1 -3 -1 -3 -3 0 6 6

A -2 -1 5 0 5 -3 0 -2 -1 -1

E 0 6 -1 -3 -1 -3 -3 0 6 6

Page 37: Chapter 2

Global Alignment: Needleman-Wunsch Algorithm

• Construct a matrix F indexed by i, j, one index from each sequence

• F(i, j) is the score of the best alignment between the initial segment x1…i of x up to xi and the initial segment y1…j of y up to yj

• Begin F(0,0)=0, we then fill the matrix from top left to bottom right

• If F(i-1, j-1), F(i-1, j), and F(i, j-1) are known, it is possible to calculate F(i, j)

Page 38: Chapter 2

Three Ways of Alignments

• xi is aligned to yj

IGA xi

LGVyj

• xi is aligned to a gap

AIG A xi

GVyj - -• yj is aligned to a gap

GA xi - -

SLG Vyj

Page 39: Chapter 2

Three Ways of Alignments

• xi is aligned to yj, F(i, j)= F(i-1, j-1)+s(xi, yj)

• xi is aligned to a gap, F(i, j)= F(i-1, j)-d

• yj is aligned to a gap, F(i, j)= F(i, j-1)-d

Page 40: Chapter 2

Three Ways of Alignments

djiF

djiF

yxsjiF

jiFji

)1,(

,),1(

),,()1,1(

max),(

Page 41: Chapter 2

The Diagram

F(i-1,j-1) F(i,j-1)

s(xi,yj) -d

F(i-1,j) F(i,j)

-d

Page 42: Chapter 2

The F Matrix

• As we fill in the F(i, j), we also keep a pointer in each cell back to the cell from its F(i, j) was derived

• Along the top row, where j=0, the values F(i, j-1) and F(i-1, j-1) are not defined, so the values F(i, 0) must be handled specially

• The values F(i, 0) represent alignments of a prefix of x to all gaps in y, so we can define F(i, 0) =-id. Likewise, F(0, j)=-jd

• F(n, m) is the best score for an alignment of x1…n

to y1…m

Page 43: Chapter 2

The Global Dynamic Programming Matrix

Page 44: Chapter 2

Local Alignment: Smith-Waterman Algorithm

• In previous section, we know which sequences we want to align, and we are looking for the best match between them from one end to the other

• Most often, we are looking for the best alignment between subsequence of x and y

• When it is suspected that two protein sequence may share a common domain, or when comparing extended sections of genomic DNA sequence

Page 45: Chapter 2

Local Alignment

• It is the most sensitive way to detect similarity when comparing two very highly diverged sequences, even if they may share evolutionary origin

• In this case, only part of the sequence has been under strong enough selection to preserve detectable similarity; the rest will have accumulated so much noise through mutation that it is no longer alignable

• The highest scoring alignment of subsequences of x and y is called the best local alignment

Page 46: Chapter 2

The Algorithm

• The algorithm is closely related to that for global alignments

However, there are two differences.• In each cell in the table, F(i, j) is allowed to

take 0 if all other options have values less than 0

• An alignment can end anywhere in the matrix

Page 47: Chapter 2

The Algorithm

)1,(

,),1(),,()1,1(

,0

max),(

djiF

djiFyxsjiF

jiFji

Page 48: Chapter 2

The First Difference

• Taking the option 0 corresponds to starting a new alignment

• If the best alignment up to some point has a negative score, it is better to start a new one, rather then extend the old one

• The top row and left column will be filled with 0s, not –id and –jd as for global alignment

Page 49: Chapter 2

The Second Difference

• We look for the highest value of F(i, j) over the whole matrix, and start the traceback from there

• The traceback ends when we meet a cell with value 0, which corresponds to the start of the alignment

Page 50: Chapter 2

Example

Page 51: Chapter 2

Example

• Note that the local alignment is a subset of the global alignment, but that is not always the case

Global Alignment

HEAGAWGHE-E

– –P –AW –HEAE

Local Alignment

AWGHE

AW –HE

Page 52: Chapter 2

Local Alignment

• When considering local alignment, the expected score for a random match must be negative

• If this is not true, then long matches between entirely unrelated sequences will have high scores, just based on their length

• Some s(a, b) must be greater than 0, otherwise the algorithm won’t find any alignment at all

Page 53: Chapter 2

Expected Score

• Assume that there is no gap and successive positions are independent

• The expected score of a fixed length alignment can be evaluated by

where qa is the probability of a symbol a at any given position in a sequence.

),,(,

basqqba

ba

Page 54: Chapter 2

Expected Score

When s(a, b) is derived as a log likelihood ratio,

using the same qa as for the random model

probabilities, we will have

0log),(,,

ab

bab

baa

baba p

qqqqbasqq

Page 55: Chapter 2

Repeated Matches

• If one or both of the sequences are long, there may exist many different local alignments with a significant score, and in most cases we would be interested in all of these

• For example, there may have many copies of a repeated domain or motif in a protein

• We want to find one or more non-overlapping copies of sections of one sequence in the other

Page 56: Chapter 2

Repeated Matches

• We are interested in matches scoring higher than T

• Let y be the sequence containing the domain or motif, and x be the sequence in which we are looking for multiple matches

Page 57: Chapter 2

The Algorithm

• The meaning and the recurrence of F(i, j) are different

• In the final alignment, x will be partitioned into regions that match parts of y in gapped alignments, and region that are unmatched

• F(i, j) for j 1 is the best sum of match≧ scores to x1…i , assuming that xi is in a matched region, and the corresponding match ends in xi and yj

• F(i, 0) is the best sum of completed match scores to the subsequence x1…i , i.e. assuming that xi is in an unmatched region

Page 58: Chapter 2

The Algorithm

)1,(

,),1(

),,()1,1(

),0,(

),(

;,...,1 ,),1(

),0,1(max)0,(

,0)0,0(

djiF

djiF

yxsjiF

iF

jiF

mjTjiF

iFiF

F

ji

Page 59: Chapter 2

The Algorithm

• F(i, 0) handles unmatched regions and ends of matches, only allowing matches to end when they have at least T

• F(i, j) handles starts of matches and extensions• The total score of all matches is derived by

adding an extra cell to the matrix, F(n+1, 0)• This score will have T subtracted for each

match; if there were no matches of score greater than T, it will b 0

Page 60: Chapter 2

The Algorithm

• The individual match alignments can be obtained by tracking back from cell (n+1,0) to (0,0), at each point going back to the cell that was the source of the score in the current max() operation

• This traceback procedure is a global procedure, showing what each residue in x will be aligned to

Page 61: Chapter 2

Example

Page 62: Chapter 2

Remark

• The algorithm obtains all the local matches in one pass

• It finds the maximal scoring set of matches, in the sense of maximizing the combined total of excess of each match score above the threshold T

• Changing the value of T changes what the algorithm finds

• Increasing T may exclude matches

Page 63: Chapter 2

Remark

• Decreasing it may split them, as well as finding new weaker ones

• A locally optimal match in the sense of the preceding section will be split into pieces if it contains internal subalignment scoring less than -T

Page 64: Chapter 2

Overlap Matches

• We may expect that one sequence contains the other, or that they overlap

• It occurs when comparing fragments of genomic DNA sequence to each other, or to larger chromosomal sequences

• We do not penalize overhanging ends• We want a match to start on the top or left

border of the matrix, and finish on the right or bottom border

Page 65: Chapter 2

Example

Page 66: Chapter 2

The Algorithm

• F(i, 0)=0 for i=1,…,n and F(0, j)=0 for j=1,…,m• F(i, j) will be the same as the one in global align

ment

• We set Fmax to be the maximum value on the right border (i,m), i=1,…,n, and the bottom border (n,j), j=1,…,m

• The traceback starts from the maximum point and continues until the top or left edge is reached

Page 67: Chapter 2

Example

Page 68: Chapter 2

More Complex Models

• Previously we only consider the gap score to be a simple multiple of the length (γ(g)=-jd)

• This type of scoring is not ideal for biological sequence: it penalizes additional gap steps as much as the first

• We should penalize more for additional gap steps

• When gaps occur, they are often longer than one residue

Page 69: Chapter 2

More Complex Models

• We may use a general function for γ(g)

• This will require intensive computation

1,...,0 ),()1,(

1,...,0 ),(),1(

),,()1,1(

max),(

jkkjjiF

ikkijiF

yxsjiF

jiFji

Page 70: Chapter 2

More Complex Models

• This procedure now requires operations to align two sequences of length

• In each cell (i,j) we have to look at i+j+1 potential precursor, not just three as previously

2

)1()1(1

0 0

nmmnmnmnji

n

i

m

j

)( 3nO

Page 71: Chapter 2

More Complex Models

• Prohibitively costly increase in computational time in many case.

• Under some conditions computational time to ,although the constant of proportionality is higher in these case.

• In each cell have to look at 2K+1 potential precursors.

)( 2nO

12120 0

knmkn

i

m

j

Page 72: Chapter 2

Alignment with Affine Gap Scores

• γ(g)= -d-(g-1)e

d is the gap open penalty

e is the gap-extension penality

• O(n2)

• We now have to keep track of multiple values for each pair of residue coefficients (i,j) in place of the single value F(i,j)

Page 73: Chapter 2

Let M(i,j) be the best score up to (i,j) given

that xi is aligned to yj (left case), Ix(i,j) be the

best score given that xi is aligned to a gap (in an insertion w.r.t y, central case), and

Iy(i,j) be the best score given that yj is aligned to a gap (in an insertion w.r.t. x, right case).

Page 74: Chapter 2

Assumption

• We assume that a deletion will not be followed directly by an insertion

• This will be true for the optimal path if (-d-e) is less then the lowest mismatch score

Page 75: Chapter 2

The Recurrence Relations

Page 76: Chapter 2

The Diagram of the Relationships

Page 77: Chapter 2

The Interpretation

• The transitions each carry a score increment, and the stats each specify a ∆(i,j) pair, which is used to determine the change in indices i and j when that state is entered

• The new value for a state variable at (i,j) is the maximum of the scores corresponding to the transitions coming into the state

Page 78: Chapter 2

The Interpretation

• Each transition score is given by the value of the source state at the offsets specified by the ∆(i,j) pair of the target state, plus the specified score increment

• This type of description corresponds to a finite state automaton (FSA) in computer science

Page 79: Chapter 2

Alignment

An alignment corresponds to a path through

the states, with symbols from the underlying

pair of sequences being transferred to the

alignment according to the ∆(i,j) values in

the states.

Page 80: Chapter 2

The Algorithm

• Transitions to state M indicate letter-to-letter correspondences, so they are labeled with s(xi, yj) corresponding to the substitution score for replacing xi with yj

• We label every transition from M to a gap state (Ix or Iy) with the gap initiation penalty –d

• we label each transition from a gap state to itself with the gap extension penalty –e

Page 81: Chapter 2

Initialization

• M(0,0) = 0• Ix(i,0) = d + i * e• Iy(0,j) = d + j* e• The optimal alignment given by max [ M(n, m), Ix

(n, m), Iy(n, m)• Every path through this model corresponds to an

alignment• If we sum every score on the transitions of a pat

h, we will have the same score as a global alignment dynamic programming problem with affine gap penalties

Page 82: Chapter 2

Example

• x3 has been matched to y5 and we are in state M

• Now x4 has to be matched and it is best to match it to a gap on y

• The current state will change from M to Ix and the penalty of d (gap open) must be paid

• If x5 is also assigned to a gap on y, the penalty of e (gap extend) must be paid

• If then x6 is assigned to y6, the state will change back to M and the cost of s(x6,y6) will be added

Page 83: Chapter 2

Example

Page 84: Chapter 2

Alignment with Affine Gap Scores

• It is in fact frequent practice to implement an affine gap cost algorithm using only two states, M and I.

Page 85: Chapter 2

Alignment with Affine Gap Scores

• This is only guaranteed to provide the correct result if the lowest mismatch score is >= -2e

• For those interested in pursuing the subject, the simpler state-based automata are called Moore machine, and the transition-emitting systems are called Mealy machines

Page 86: Chapter 2

More Complex FSA Models

• Four-state FSA with two match states

• There may be high fidelity regions of alignment without gaps, corresponding to match state A

Page 87: Chapter 2

More Complex FSA Models

• Separated by lower fidelity regions with gaps, corresponding to match state B and gap states Ix and Iy

• Given an alignment path, there is also an implicit attachment of labels to the symbols in the original sequences, indicating which state was used to match them

Page 88: Chapter 2

Exercise 2.10

• Calculate the score of the example alignment in Figure 2.10, with d=12, e=2

Page 89: Chapter 2

Heuristic Alignment Algorithms

• Using dynamic programming to compute similarity between two sequences will cost O(n*m)

• With this cost, aligning a sequence against a database containing millions of sequences is not feasible

• The current protein database contains of the order of 10^8 residues, so far a sequence of length 10^3, approximately 10^11 matrix cells must be evaluated to search the complete database

Page 90: Chapter 2

Heuristic Alignment Algorithms

• The current protein database contains of the order of 10^8 residues, so far a sequence of length 10^3, approximately 10^11 matrix cells must be evaluated to search the complete database

• At ten million matrix cells a second, which is reasonable for a single workstation at the time this is being written, this would take 104 seconds, or around 3 hours

• If we want to search with many different sequences, time rapidly becomes an important issue

Page 91: Chapter 2

Heuristic Alignment Algorithms

• The goal of these method is to search as small a fraction as possible of cells in dynamic programming matrix, while still looking at all the high scoring alignment

• For the scoring matrices used to find distant matches, that exact methods become intractable, and we must use heuristic approaches that sacrifice some sensitivity

Page 92: Chapter 2

Heuristic Alignment Algorithms

• Two of the best-known algorithms are FASTA and BLAST

• They work from the same basic idea, namely that – most sequences in the database don’t match, and because of that the algorithms apply some heuristic to exclude many of the unrelated sequences

• If these heuristics are used there will be no guarantee that we will find the optimal scoring alignment

Page 93: Chapter 2

General Approach

• Seeds: All (good ungapped matches) to subsequences of X of a given length(X is our query)

• For each sequence Y in the database 1. Search for seeds in Y 2. Extend alignment around seeds or partition the alignment problem 3. If a high scoring match in last step was found, then use dynamic programming around a good match

Page 94: Chapter 2

FastaFast Alignment

• Seeds: All subsequences of length ktup

• Typical values of ktup are

- Proteins: 1-2

- DNA: 4-6

• Store all seeds with their starting position in X in a hashtable

Page 95: Chapter 2

Steps

1. Using the hashtable we can now find all exact matches(hotspots) to seeds in Y. The running time will be linear in the length of Y: O(|Y|)

2. Chain hotspots into runs of hotspots on the same diagonal. This is done efficiency by sorting hotspots on (j-i). A run consists of one or more consecutive hotspots a long a common diagonal

Page 96: Chapter 2
Page 97: Chapter 2

The Steps

(2-a) Using a function that takes a set of hotspots and their distance to each other, we can find out how good the runs are. Score(run) = δ(hotspot, distance)

(2-b) Pick 10 high scoring runs:

R1 = (α1, β1) …….

R10 = (α10, β10)

(2-c) If max{S(αi, βi)} 1 ≤ i ≤ 10 is sufficient high we continue

Page 98: Chapter 2

The Steps

(2-d) Now we can construct a weighted graph. If the longest path in this graph is sufficiently high we can continue to next step. The length of a path is the sum of the weights on the edges and the weights on the vertices of the path

Page 99: Chapter 2
Page 100: Chapter 2

The Steps

3. Perform banded dynamic programming around the high scoring run. The running time of this step is O( |X| + |Y| ), because ‘c’ is limited by a constant

Page 101: Chapter 2
Page 102: Chapter 2

BLASThttp://www.ncbi.nlm.nih.gov/BLAST/

Basic Local Alignment Search Tool

• The package provides programs for finding high scoring local alignment between a query sequence and a target database

• The idea is that true match alignments are vary likely to contain somewhere within them a short stretch of identities, or very high scoring matches.

Page 103: Chapter 2
Page 104: Chapter 2

BLAST

• Look initially for such short stretches and use them as ‘seeds’, from which to extend out in search of a good longer alignment.

• By keeping the seed segments short, it is possible to pre-process the query sequence to make a table of all possible seeds with their corresponding start point.

Page 105: Chapter 2

BLAST

• Make a list of all ‘neighbourhood words’ of a fixed length

• Scan through the database->whenever it finds a word in the set -> starts a ‘hit extension’ process to extend the possible match as an ungapped alignment in both directions->stopping at a maximum scoring extension

Page 106: Chapter 2

BLAST

• Only find ungapped alignments

• restricting to ungapped alignments misses only a small proportion of significant matches.

• Can find and report more than one high scoring match per sequence pair and can give significance values for combined scores.

Page 107: Chapter 2

BLAST

• Word: subsequence of length d• Typical values of d are -Protein: 3 -DNA: 11 • Seeds in BLAST are longer than seeds in

FASTA, but it isn’t necessary to find exact matches

• Seeds: All words w’ such that s (w, w’) ≥ t for some word w in X

Page 108: Chapter 2

The Steps

1. Find exact matches to seeds in Y

2. Extend each match to maximal extension and report matches with score > c

Page 109: Chapter 2

Linear Space Alignments

• We want to save the memory usage for computational resource

• All the algorithms described so far calculate score matrices such as F(i,j), which have overall size nm

• For two protein sequences, of typical length a few hundred residues, this is well within the capacity of modern desktop computers

• If one or both of the sequence is a DNA sequence tens or hundreds of thousands of bases long, the required memory for the full matrix can exceed a machine’s physical capacity

Page 110: Chapter 2

Linear Space Alignments

• To build the matrix take O(n2) space, this can be too much on larger n

• If we are only interested in the max score, there is no need for storing more than the current row j and the last row j-1

• Every value (i,j) of the current row is calculated from the predecessor (i-1,j) and from values of the last row

• By doing so, only linear space is required but we also miss the possibility to backtrack and so to find the max scoring sequence

Page 111: Chapter 2

Divide and Conquer Algorithm

• We halve the problem all the time

• The optimal alignment for the whole matrix will be the concatenation of the optimal alignments of the two submatrices

• These can again be calculated recursively either until the length is zero or until the length is so low, that is can be calculated with the standard O(n2) algorithm

Page 112: Chapter 2

The Approach

• We want to divide the matrix at the entry (i,j)• We set v=[m/2] and now we have to find the

column u so that the entry (u,v) is part of the optimal alignment for the whole matrix

• For each cell (i,j) with j>v, we store not only F(i,j) but also the column where the optimal path to (i,j) crossed row v

• Thus we can obtain the column u and so the cell (u,v) once we reach the entry (n,m)

Page 113: Chapter 2
Page 114: Chapter 2

Example

• F(2,2) has been calculated and at the same time, the parent is set to (1,1)

• If the score of F(2,3) is calculated from the score of F(2,2), the parent of (2,3) will be set to the parent of (2,2) i.e. (1,1)