Pairwise Sequence Comparison

Pairwise Sequence Comparison

Stat 246, Spring 2002, Week 5,

Sequence comparison: topics

General concepts

Dot plots

Global alignments

Scoring matrices

Gap penalties

Dynamic programming

Chance or common ancestry?

Dot Plot

This is the earliest, simplest and most complete method for comparing two sequences

It is possible to filter the plot to minimise noise whilst preserving the obvious relationship

This plot can identify

• regions of similarity

• internal repeats

• rearrangement events

A C A C A C T A

A

G

C

A

C

A

C

A

b

a .A dot goes where the two sequences match

Sequence1 down:

Sequence 2along:

(Add a “guard” row and colum.)

Connect the dotsalong diagonals.

Extensions to dot plots

Modern dot plots are more sophisticated, using the notions of

window : size of diagonal strip centered on an entry, over which matching is accumulated, and

stringency: the extent of agreement required over the window, before a dot is placed at the central entry.

e.g. for a window of size 5, we might require at least 3 matches, and then we put a dot in the central spot. More complex scoring rules can be used.

Human globin vs. human myoglobin

a

beta-human.pep ck: 1,242, 1 to 146050100150100500

Human LDL receptor vs. itself (w=30, s=9)

a

ldlrecep.pep ck: 3,641, 1 to 860 02004006008008006004002000

Human LDL receptor vs. itself (40, 15)

COMPARE Window: 40 Stringency: 15.0 Points: 5,287

ldlrecep.pep ck: 3,641, 1 to 860


0

200

400

600

800

8006004002000

Human LDL receptor vs. itself (40, 17.5)


0

200

400

600

800

8006004002000



Human LDL receptor vs. itself (40, 20)


0

200

400

600

800

8006004002000



Plasmodium falciparum MSP3 vs. itself (30,9)

a

msp3.pep ck: 4,247, 1 to 3800100200300

3002001000



msp3.pep ck: 4,247, 1 to 380

msp3.pep ck: 4,247, 1 to 380

0

100

200

300

3002001000

Global alignment

An alignment of two sequences a and b is an arrangement of a and b by position, where a and b can be padded with gap symbols to achieve the same length:

a: AGCACAC-A or AG-CACACA

b: A-CACACTA ACACACT-A

If we read the alignment column-wise, we have a protocol of edit operations that lead from a to b.

Left: Match (A,A) Right: Match (A,A)

Delete (G,-) Replace (G,C)

Match (C,C) Insert (-,A)

Match (A,A) Match (C,C)

Match (C,C) Match (A,A)

Match (A,A) Match (C,C)

Match (C,C) Replace (A,T)

Insert (-,T) Delete (C,-)

Match (A,A) Match (A,A)

The left-hand alignment shows one Delete, one Insert, and the other edit operations are Matches.

The right-hand alignment shows one Insert, one Delete, two Replaces, and some trivial ones.

Cost (scoring) of global alignments; optimal global alignments

Next we turn the edit protocol into a measure of distance by assigning a “cost” or “weight” S to each operation. For example, for arbitrary characters u,v from A we may define

S(u,u) = 0; S(u,v) = 1 for u ≠ v; S(u,-) = S(-,v) = 1. (Unit Cost)

This scheme is known as the Levenshtein distance, also called unit cost model. Its predominant virtue is its simplicity. In general, more sophisticated cost models must be used. For example, replacing an amino acid by a biochemically similar one should weight less than a replacement by an amino acid with totally different properties. Details shortly. Now we are ready to define the most important notion for sequence analysis:

The cost of an alignment of two sequences a and b is the sum of the costs of all the edit operations that lead from a to b.

An optimal alignment of a and b is an alignment which has minimal cost among all possible alignments.

The edit distance of a and b is the cost of an optimal alignment of a and b under a cost function S. We denote it by d(a,b).

Using the unit cost model for S in our previous example, we obtain the following cost:

a: AGCACAC-A or AG-CACACA

b: A-CACACTA ACACACT-A

cost: 2 cost: 4

Here it is easily seen that the left-hand assignment is optimal under the unit cost model, and hence the edit distance d(a,b) = 2.

More general scores = - costs: see later.

C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 6

D -3 0 -1 -1 -2 -1 1 6

E -4 0 -1 -1 -1 -2 0 2 5

Q -3 0 -1 -1 -1 -2 0 0 2 5

H -3 -1 -2 -2 -2 -2 1 -1 0 0 8

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5

K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5

I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4

L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4

V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7

W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

C S T P A G N D E Q H R K M I L V F Y W

134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI | ||| | | |||||| | || || 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM

D:D = +6

D:R = -2

From Henikoff 1996

Scoring Matrices

Physical/Chemical similarities

comparing two sequences according to the properties of their residues may highlight regions of structural similarity

Identity matrices

by stressing only identities in the alignment, stretches of sequence that may have diverged will not penalise any remaining common features

Scoring Matrices (ctd)

As the direct source of residue by residue comparison scores the scoring matrix you choose will have a major impact on the alignment calculated

The most commonly used will be one of the mutation matrices

PAM or BLOSUM

Von Bing will explain the derivation of these and other mutation matrices next Tuesday.

The matrix that performs best will be the matrix that best reflects the evolutionary separation of the sequences being aligned.

Statistical motivation for alignment scores

pr(data|H) = pr( |H) = pr( |H) x ...

= (1-p)apd d = # disagreements, a = # agreements, p = (1-e-8t)

pr(data|R) = pr( |R) = pr( |R) x ...

= ( )a( )d

= a log + d log . Since p < , log <0, log >0

score = a + d (-) >0 match score, -<0 mismatch penalty

Note that if t 0, p 6t, 1-p 1 and so log4, while - log8t is large and negative: a big difference in the two scores.

Conversely, if t is large, p = (1-), = 1-, and log(1-) -, while 1-p = (1+3), = 1+3, and so log(1+3) 3. Thus the scores are about 3:1.

AGCTGATCA...AACCGGTTA...Alignment: H = homologous (indep. sites, Jukes-

Cantor)R = random (indep. sites, equal freq.)

Hypotheses:

34

34

14

log {pr(data|H)pr(data|R) } 1-p

1/4 p3/4

34

p3/4

1-p1/4

≈ ≈ ≈ ≈ ≈

34

p3/4 ≈

14

1-p1/4

≈

We can do the same with any other Markov substitution matrix for molecular evolution. E.g. with a PAM or BLOSUM matrix of probabilities,

a1 ..... am

b1 ..... bmdata = a gap free alignment of two a.a. sequence fragments

pr(data|H) = aipaibi(2t) pr(data|R) = aibi

log{ } = log{ }

The elements of a log-odds score matrix are typically > 0 on the diagonal and < 0 off the diagonal, but not always.

Also the relative sizes of match and mismatch penalties increase as #PAMs (t) decreases. Thus PAM(120) is more stringent than PAM(250), while PAM(360) is less stringent than it.

PAM(0) = the identity matrix is the toughest.

There are plenty of score matrices based on other principles.

m

1

i

pr(data|H)pr(data|R)

ipaibi(2t)/ bi

Below diagonal: BLOSUM62 substitution matrixAbove diagonal: Difference matrix obtained by subracting the

PAM 160 matrix entrywise.

From Henikoff & Henikoff 1992


0 -1 1 0 2 1 1 2 1 2 0 0 2 4 1 5 1 2 -2 5 C

2 0 -2 0 -1 0 0 0 1 0 0 0 1 0 1 -1 1 1 -1 S

C 9 2 -1 -1 -1 0 0 0 0 0 0 -1 0 -1 1 0 1 1 3 T

S -1 4 2 -2 -1 -1 0 0 -1 -1 -1 1 1 0 -1 0 0 2 1 P

T -1 1 5 2 -1 -2 -2 -1 0 0 1 1 0 0 1 0 1 1 2 A

P -3 -1 -1 7 2 0 -1 -2 0 1 1 0 0 -1 0 -1 1 2 4 G

A 0 1 0 -1 4 3 -1 -1 0 0 1 -1 0 -1 0 -1 0 0 0 N

G -3 0 -2 -2 0 6 2 -1 -1 -1 0 -1 0 0 0 0 2 1 3 D

N -3 1 0 -2 -2 0 6 1 0 0 2 2 1 -1 0 0 2 2 4 E

D -3 0 -1 -1 -2 -1 1 6 0 -2 0 1 1 -1 0 0 1 3 3 Q

E -4 0 -1 -1 -1 -2 0 2 5 2 -1 0 1 0 -1 0 1 2 2 H

Q -3 0 -1 -1 -1 -2 0 0 2 5 -1 -1 0 -1 1 0 1 3 -4 R

H -3 -1 -2 -2 -2 -2 1 -1 0 0 8 1 -2 -1 1 1 2 3 1 K

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 -2 -1 -1 0 1 2 4 M

K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5 -1 1 0 0 1 3 I

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 -1 0 -1 1 2 L

I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 0 1 2 4 V

L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 -1 -2 1 F

V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 2 Y

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 -1 W

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7

W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11


Above diagonal: SG scoring system (Feng et al., 1985)Below diagonal: Log-odds matrix for 250 PAMs (Dayhoff et al., 1978)


6 4 2 2 2 3 2 1 0 1 2 2 0 2 2 2 2 3 3 3 C

6 5 4 5 5 5 3 3 3 3 3 3 1 2 2 2 3 3 2 S

C 12 6 4 5 2 4 2 3 3 2 3 4 3 3 2 3 1 2 1 T

S 0 2 6 5 3 2 2 3 3 3 3 2 2 2 3 3 2 2 2 P

T -2 1 3 6 5 3 4 4 3 2 2 3 2 2 2 5 2 2 2 A

P -3 1 0 6 6 3 4 4 2 1 3 2 1 2 2 4 1 2 3 G

A -2 1 1 1 2 6 5 3 3 4 2 4 1 2 1 2 1 3 0 N

G -3 1 0 -1 1 5 6 5 4 3 2 3 0 1 1 3 1 2 0 D

N -4 1 0 -1 0 0 2 6 4 2 2 4 1 1 1 4 0 1 1 E

D -5 0 0 -1 0 1 2 4 6 4 3 4 2 1 2 2 1 2 1 Q

E -5 0 0 -1 0 0 1 3 4 6 4 3 1 1 3 1 2 3 1 H

Q -5 -1 -1 0 0 -1 1 2 2 4 6 5 2 2 2 2 1 1 2 R

H -3 -1 0 0 -1 -2 2 1 1 3 6 6 2 2 2 3 0 1 1 K

R -4 0 0 0 -2 -3 0 -1 -1 1 2 6 6 4 5 4 2 2 3 M

K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 6 5 5 4 3 2 I

M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 6 5 4 3 4 L

I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 6 4 3 3 V

L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 6 5 3 F

V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 6 3 Y

F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 6 W

Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17


Gap penalties

Gap penalties are usually composed of two parts:

Gap opening penalty

This reduces the alignment score and therefore must create more significant alignment downstream than would be present if no gap were created

The size of the penalty is usually of the order of one to three times the size of values in the scoring matrix

Gap penalties (ctd)

Gap extension penalty

If a gap has been created then extending it should not be as hard to do

On the other hand we want to limit the size of the gap to practical lengths

A smaller gap extension penalty may allow an alignment to resolve situations where complete loops may be missing between one structure and another

Low gap penalty eclustalw May 24, 1999 18:44

lgb1_pea.pep ck: 2970 from: 1 to: 147 Length: 147 hbhu.pep ck: 3588 from: 1 to: 147 Length: 147

Pairwise similarity parameter: K-Tuple length: 1 Gap Penalty: 3 Number of diagonals: 5 Diagonal window size: 5 Scoring Method: Percentage

Multiple alignment parameter: Gap Penalty (fixed): 1.00 Gap Penalty (varying): 0.05 Gap separation penalty range: 8 Percent. identity for delay: 40% List of hydrophilic residue: GPSNDQEKR Protein Weight Matrix: blosum

10 20 30 40 50 60 . . . . . .LGB1_PEA.pep --GFTDKQE-ALVNSSSEFKQNLPGYSILFYTIVLEKAPAAKGLF-SF--LKDTAGVEDSHBHU.pep MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVY--PWTQRFFESFGDLSTPDAVMGN * . *. * * .*. * .. * ** * *

LGB1_PEA.pep PKLQAHAEQVFGLVRDSAAQLR-TKGEVVLGNATLGAIHVQKGVTNP-HFVVVKEALLQTHBHU.pep PKVKAHGKKVLGAFSDGLAHLDNLKGTF----ATLSELHCDKLHVDPENFRLLGNVLVCV **..** .* * * *.* ** *** .* * * .* .. *.

LGB1_PEA.pep IKKASGNNWSEELNTAWEVAYDGLATAIKKAMKTAHBHU.pep LAHHFGKEFTPPVQAAYQKVVAGVANAL--AHKYH . . * . ...* . *.*.*. * *

Middling gap penalty eclustalw May 24, 1999 18:50




10 20 30 40 50 60 . . . . . .LGB1_PEA.pep ----GFTDKQEALVNSSSEFKQNLPGYSILFYTIVLEKAPAAKGLFSFLKDTAGVEDSPKHBHU.pep MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK .* . * .. .* . * * * **

LGB1_PEA.pep LQAHAEQVFGLVRDSAAQLRTKGEVVLGNATLGAIHVQKGVTNP-HFVVVKEALLQTIKKHBHU.pep VKAHGKKVLGAFSDGLAHLDN---LKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAH ..** .* * * *.* . . *** .* * * .* .. *. . .

LGB1_PEA.pep ASGNNWSEELNTAWEVAYDGLATAIKKAMKTAHBHU.pep HFGKEFTPPVQAAYQKVVAGVANALAHKYH-- * . ...* . *.*.*. . .

Very high gap penalty eclustalw May 24, 1999 18:52




10 20 30 40 50 60 . . . . . .LGB1_PEA.pep ----GFTDKQEALVNSSSEFKQNLPGYSILFYTIVLEKAPAAKGLFSFLKDTAGVEDSPKHBHU.pep MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK .* . * .. .* . * * * **

LGB1_PEA.pep LQAHAEQVFGLVRDSAAQLRTKGEVVLGNATLGAIHVQKGVTNPHFVVVKEALLQTIKKAHBHU.pep VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPEN--FRLLGNVLVCVLAHH ..** .* * * *.* . . * . ... * * .. *. . .

LGB1_PEA.pep SGNNWSEELNTAWEVAYDGLATAIKKAMKTAHBHU.pep FGKEFTPPVQAAYQKVVAGVANALAHKYH-- * . ...* . *.*.*. . .

Dynamic Programming

This is a mathematical implementation that can be seen as an extension of the dotplot method

Rather than dots, the comparison matrix positions are assigned values that reflect the scores in the scoring matrix

For obtaining optimal alignments

Dynamic Programming

The optimum alignment is obtained by tracing the highest scoring path from the top left-hand corner to the bottom right-hand corner of the matrix

When the alignment steps away from the diagonal this implies an insertion or deletion event, the impact of which can be assessed by the application of a gap penalty

A C A C A C T A

A

G

C

A

C

A

C

A

b

a 0 1 0 1 0 1 1 0

1 1 1 1 1 1 1 1

1 0 1 0 1 0 1 1

0 1 0 1 0 1 1 0

1 0 1 0 1 0 1 1

0 1 0 1 0 1 1 0

1 0 1 0 1 0 1 1

0 1 0 1 0 1 1 0

Dynamic programming: the formula

Suppose that our two sequences are a=(a1,...,am) and b=(b1,...,bn),

and that we denote by dij the edit distance between the initial

segments ai=(a1,...,ai) and bj=(b1,...,bj) of a and b.

Extend this to i=j=0 by writing d00=0.

Supposing that a deletion or an insertion incurs a penalty of +1,

the following formula summarizes our verbal argument:

dij=min(di-1,j-1 + s(ai,bj), di,j-1 + 1, di-1,j + 1).

(More is needed to give a complete algorithm: what is it?)

A C A C A C T A

0 1 2 3 4 5 6 7 8

A 1 0 1 2 3 4 5 6 7

G 2 1 1 2 3 4 5 6 7

C 3 2 1 2 2 3 4 5 6

A 4 3 2 1 2 2 3 4 5

C 5 4 3 2 1 2 2 3 4

A 6 5 4 3 2 1 2 3 3

C 7 6 5 4 3 2 1 2 3

A 8 7 6 5 4 3 2 2 2

b

a

Chance or common ancestry?

Idea: calculate optimal alignment scores for pairs of sequences where one is a randomized (shuffled) version of the original. This will give a distribution of random scores, representing chance similarity rather than homology.

The score from our original pair of sequences can be referred to this distribution and assigned a Z-score (subtract mean of randoms and divide by SD of randoms), or (better) a p-value.

Criticism: Such random a.a. sequences might have plausible a.a. compositions but are quite unlike real protein sequences.

Partial reply: a) restrict the randomization to blocks; or, b) create a distribution of chance similarity scores using real a.a. sequences known or assumed not to be homologous to our query sequence. [Other approaches use theory, but this is still subject to the criticism above.]

Dynamic Programming

Based on notes by George Rudy, formerly WEHI.

“Life must be lived forwards and understood backwards.”

Søren Kierkegaard

What is DP?

Operations research: “A mathematical formalism applicable to problems involving optimization of decisions over time.”

(after R. Bellman and S. Dreyfus)

Bioinformatics : “An algorithm for finding optimal sequence alignments given an additive alignment score.”

( after R. Durbin, et al.)

Computer programming: “An approach to algorithm design whereby the target problem is decomposed into smaller problems that are then solved independently.”

(after R. Sedgewick)

Where did DP come from?

- Richard Bellman

- The RAND Corporation

- “Dynamic” and “Programming”

Where can DP be applied?

- Both discrete and continuous problems concerning deterministic, stochastic, or adaptive processes

- Multiple fields: research, industry, finance,…

- Examples: allocation processes

smoothing and scheduling processes

optimal search and stopping techniques

optimal trajectories

multistage production processes

feedback control processes

Markovian decision processes

DP in biomedical literature (1)

0

5

10

15

20

25

Years

DP in biomedical literature (2)- A symmetric-iterated multiple alignment of protein sequences.

[Brocchieri, L. and Karlin S., J. Mol. Biol. 276(1):249-64, 1998.]

- Sequence assembly validation by multiple restriction digest fragment coverage analysis.

[Rouchka, E.C. and States, D.J., ISMB. 6:140-7, 1998.]

- Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment.

[Gracy, J. and Argos, P., Bioinformatics 14(2):164-73, 1998.]

- A segment-based dynamic programming algorithm for predicting gene structure.

[Wu, T.D., J. Comput. Biol. 3(3):375-94, 1996.]

- Automatic detection of cardiac contours on MR images using fuzzy logic and dynamic programming.

[Lalande A. et al., Proc. AMIA Annu. Fall Symp. :474-8, 1997.]

- Process models for production of beta-lactam antibiotics.

[Bellgardt, K.H., Adv. Biochem. Eng. Biotechnol. 60:153-94, 1998.]

- Dynamic programming approach for newborn’s incubator humidity control.

[Bouattoura, D. et al., IEEE Trans. Biomed. Eng. 45(1):48-55, 1998.]

- Minimum energy trajectories of the swing ankle when stepping over obstacles of different heights.

[Chou L.S. et al., J. Biomech. 30(2):115-20, 1997.]

- A theoretical study of the socioecology of ungulates. II. A dynamic programming study of the stochastic formulation.

[Paveri-Fontana, S.L. and Focardi, S. Theor. Popul. Biol. 46(3):279-99, 1994.]

What problems are suitable for DP?

- Essential components (common to all OR problems):

a decision-maker

access to results of decisions

- Additionally:

decisions are sequential

later decisions are affected by earlier ones

effect of a decision can be calculated independently of other decisions

The Stagecoach Problem (1)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K[after S. E. Dreyfus]

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

Some terminology

- Vertex

- Edge

- Path

-Monotonic-to-the-right

- (Admissible) path

- Stage

- State


A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

0


A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2 2

1

0


A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2 2

4

1

0


A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

10

8

7

2

4

6

7

5

1

0


A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

10

9

12

13

14

8

8

7

2

4

6

11

7

5

1

0

Some more terminology

- Optimal value function

- Policy

- Optimal policy function


A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

10

9

12

13

14

8

8

7

2

4

6

11

7

5

1

0

Efficiency of the DP approach

- At each of 9 vertices where a real choice existed: 2 additions

1 binary comparison

- At the other 6 vertices: 1 addition

Total: 24 additions

9 comparisons

- Compare this with direct evaluation of the original problem by enumeration of all 20 admissible paths:

5 additions/path = 100 additions 20 comparisons

Efficiency (2), and the Curse of Dimensionality

In general, for the n-stage problem treated here,

DP involves (n2/2) + n additions

Direct enumeration generates paths, or

additions.

Thus, for n=20, DP requires 220 additions while direct enumeration would demand 3,510,364 additions.

n

n

2

⎛

⎝⎜

⎞

⎠⎟ =

n !n2⎛⎝

⎞⎠ ! n

2⎛⎝

⎞⎠ !

(n −1) n!n2⎛⎝

⎞⎠!n2⎛⎝

⎞⎠ !


A

C

H

E L

O

D

BF

I

M

G

J P

N

K

y

x

1

2

3

-1

-2

-3

1 2 3 4 5 6

The Principle of Optimality, or Bellman’s Principle

“An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” (Bellman)

or, “An optimal sequence of decisions in a multistage decision process problem has the property that whatever the initial stage, state, and decision are, the remaining decisions must constitute an optimal sequence of decisions for the remaining problem, with the stage and state resulting from the first decision considered as initial conditions.” (Dreyfus)

or, “An optimal policy must have the property that no matter what path is taken to enter a particular state, the remaining stages (decisions) taken must constitute an optimal policy for departure from that state.”

or, “An optimal policy is comprised of optimal subpolicies.”

or, “An optimal policy from any state is independent of the path taken to that state, and is made up entirely of optimal subpolicies.”

or, ...

The optimal value function

S(x,y) = the value of the minimum-value admissible path connecting the vertex (x,y) and the terminal vertex (6,0)

eu(x,y) = the value of the edge connecting the vertices (x,y) and

(x+1, y+1)

ed(x,y) = the value of the edge connecting the vertices (x,y) and

(x+1, y-1)

S(x,y) = min {eu(x,y) + S(x+1, y+1), ed(x,y) + S(x+1, y-1)}

S(6,0) = 0.

A more formal restatement of common features of DP problems

A physical system characterized at any stage by a small set of parameters, the state variables;

At each stage of the process there is a choice of a number of decisions;

The effect of a decision is a transformation of the state variables;

The past history of the system is of no importance in determining future actions;

The purpose of the process is to maximize some function of the state variables.

The practice of DP

Imbed the specific given problem in a more general family of problems;

Define the optimal value function which associates a value with each of the various possible initial conditions of problems in that family;

Invoke the principle of optimality in order to deduce a recurrence relation characterizing that function;

Seek the solution of the recurrence relation in order to obtain the optimal policy function which furnishes the solution to the specific given problem and all other problems in the more general family as well.

More practically speaking,Determine the decision-maker and the decisions to be made;

Determine the stages;

Determine the possible states;

Formulate the optimal value function in the form of a recurrence relation;

Calculate and tabulate the optimal value function for each stage and state;

Find the optimal policy (ies) for the problem.

New problem, new terminology

Edit operations: M(atch), R(eplacement), I(nsert), D(elete).

Edit transcript: A string over the alphabet M, R, I, D that describes a transformation of one string into another. Example:

R D I M D MR D I M D M

M A - T H S

A - R T - S

Edit (Levens(h)tein) distance: The minimum number of edit operations necessary to transform one string into another. (Note: matches are not counted.) Example:

R D I M D MR D I M D M

1+ 1+ 1+ 0+ 1+ 0 = 4

Once again,

Imbed the problem in the more general family;

Define the optimal value function;

Deduce the recurrence relation;

Solve for the recurrence relation to obtain the optimal policy function.

The recurrence

Stage: position in the edit transcript;

State: I, D, M, or R;

Optimal value function: D(i, j)

where D(i, j) = edit distance of Seq1[1...i] and Seq2[1...j]

Recurrence relation:

D(i, j) = min {1 + D(i-1, j),1 + D(i, j-1), t(i, j) + D(i-1, j-1) } ,

where t(i, j) = 0 if Seq1(I) = Seq2(j), and =1 otherwise.

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0

M 1

A 2

T 3

H 4

S 5


Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0

M 1

A 2

T 3

H 4

S 5


Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1

M 1

A 2

T 3

H 4

S 5


Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2

M 1

A 2

T 3

H 4

S 5


Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1

A 2 2

T 3 3

H 4 4

S 5 5


Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1

A 2 2

T 3 3

H 4 4

S 5 5


Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2

A 2 2

T 3 3

H 4 4

S 5 5


Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3

H 4 4

S 5 5


Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4

S 5 5


Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

The traceback

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

The solutions - #1

1 0 1 1 0 = 3

DD MM RR RR MM

M A T H S

- A R T S

The traceback

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

The solutions - #2

1 0 1 0 1 0 = 3

DD MM II MM DD MM

M A - T H S

- A R T - S

The traceback

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

The solutions - #3

1 1 0 1 0 = 3

RR RR MM DD MM

M A T H S

A R T - S

DP, in general (well, for a discrete, deterministic, additive process, anyway)

F(t, s) = Opt {r(t, s, x) + aF(t´, s´) : x in X(t, s) and s´ = T(t, s, x)}

Need not be additive. When a stochastic process, r and F are expected values; the state transform is random with a probability distribution

P[T(t, s, x) = s´ | s, x]’, and

F(t´, s´) is replaced by

∑s´ {F(t´, s´) P[T(t, s, x) = s´ | s, x]}

“Life must be lived forwards and understood backwards.”

Søren Kierkegaard

Pairwise Sequence Comparison

Documents

Transcript of Pairwise Sequence Comparison