Smith and Water Man 1981

J . AMoZ. BWl. (1981), 147, 19,5197

Identification of Common Molecular Subsequences

The identification of maximally homologous subsequences among sets of long sequences is an important problem in tnolecular sequence analysis. The problem is straightforward only if one restricts consideration t o cont.igucws subsequences (segments) containing no internal deletions or insertions. The more general problem has its solution in an extension of sequence metrics (Sellers 19i.i: IVaterman el al.. 1976) developed to measure the mininlunl number of "et-cntsr' required to convert one sequence into another.

These developments in t.he modern sequence analysis began with the heuristic homology algorithm of Seedleman & Wunsch (1970) which first introduced an iterative matrix method of calculation. Sumerous other heuristic algorithms have been suggested including those of Fitch (1966) and Dayhoff (1969). More mathernat- icallg rigorous algorithms were suggested by Sankoff (1972), Reichert. ef d . (1973) and Beyer el d. (1979), but these were generally not biologically satisfjing or interpretable. Successcame with Sellers (19i4) development of a true metric measure of the distance betweef: sequences. This metric was later generalized by Waterman et al. (1976) to include deletions/insertions of arbitrary length. This metric represents the minimum number of "mutat.iona1 events'' required t,o convert one sequence into another. It is of interest to n0t.e that Smith et al. (1980) hare recently shown that under some conditions the generalized Sellers metric is equivalent to the original homology algorithm of Seedleman S. Wunsch (19iO).

In this letter we extend the above ideas to find a pair of segments. one from each of two long sequences. such that there is no other pair of segments with greater similarity (homology). The similarity measure used here allows for arbitrary length deletions and insertions.

=Ilgorithm The two molecular sequences will be A=a1a,2 . . . a, and @=b,b, . . . bm. A

similarity s(a,b) is given between sequence elements a and b. Deletions of length 12 aregiven weight W k . To find pairs of segments with high degrees of simiIarity. we set. up a matrix H . First set

HkO = Ho, = 0 for 0 I k I I?. and 0 I 1 5 m. Preliminary values of H have the interpretation that H i j is the maximum similarit,y of two segments a d i n g in ai and bi. respectively. These values are obtained from the relationship

Hij=max{Ni - +s(a.i.bj). max {Hi-k,j- Wk). mas ti',). 0) . ( I ) k 2 1 12 I

1 a; -1'. v. 8.\1 ITH .a S 1) . \ I . S . \V.A'l'ER >I.\

The formula for H i j follows by considering the possibilities for ending the

( I ) I f ai and bj are sssociated. the similarity is I segments at any ai and bj.

Hi- l . j - l +.s(ai.bj).

(2) If ai is at the end of a deletion of length k. the similarity is

H i - k - j - 1 ' v k .

(3) If bj is at the end of a deletion of length 1. the similarity is

H i - k , j - l I-1.

(4) Finally, a zero is included to prevent calculated negative similarity, indicating no similarity up to ai and bj.t

The pair of segments with maximum similarity is found by first locating the maximum element of H. The other matrix elements leading to this maximum va.lue are than sequentially determined with a traceback procedure ending with an element of H equal t.o zero. This procedure identifies the segments as well as produces the corresponding alignment. The pair of segments with the next best similarity is found by applying the traceback procedure to the second largest element of H not associated with the first. traceback.

A simple example is given in Figure 1. In t.his example the parameters s(aibj) and , IFk required were chosen on an (I priori sta.tistica1 basis. A match, a,= bj, produced an staibj) value of unity while a mismatch produced a minus one-third. These values i

have an average for long, random sequences over an equally probable four letter set , of zero. The deletion weight must be chosen to be at least equal to the difference ' ~

between a match and a mismatch. The value used here was = I-O+ 1/3*k.

FI~:. 1. Hijmstrix generated from the application ofeqn ( I ) to the sequences d-A-C'-G-C-C-A-W-U-C-1\- ! ('-C;-G and ( ~ - - ~ - ~ i - ~ - C - ~ - ~ - ( ; - ~ - ~ - - ~ - . \ - ( ; . The underlined elements indicate the trackback path fmm the maximal element 3-30. I t Zero need not be included unless there are negative values of a(a.b). 1

contains both a mismatch and an internal deletion. It is the identification of the latter which has not been previously possible in any rigorous manner.

This algorithm not only puts the search for pairs of maximally similar segments on a mathematically rigorous basis hut it can be efficiently and simply programmed on a computer.

Sorthrrn Michigan C'niversity T. F. SNI'I'H

Los Alamos Scientific Lalmratory P.0- Box 1063. Los Xlamos X. Mex. 87.545. L?.S.A.

REFEREWES Beyer. 1%'. A.. Smith. T. F.. Stein. 11. L. & rlam. S. 31. (1979). J f u f h . B k w f . 19. 9-25. Dayhoff. M. 0. (1969). ;IfasofProteitt S q w t t w a d S t r w f a r r . Sational Biomedical Research

Foundation. Silver Springs. 1IarFland. Fitch. W. M. (1966). J. -lid. B id . 16. 9-13. Xeedleman. S. B. Q Wunsch. C. D. (1970). J . No/. Biol. 48. 443-453. Reichert, T. A.. Cohen. D. S. & LVong. X. K. ('. (1973). .I. Thuotrt. Bioi. 42. 24%-261. Sankoff, D. (1972). Proe. .Yd. ;tcad. Sc i . . [*..i'.-4. 61. 4-6. Sellers. P. H. (1974). J. dppl . Mafh. (.%ow). 26. 787-793. Smith. T. F.. Waterman. 31. S. Q Fitch. 1V. JI. (1981). J. N o / . E d . In the press. Waterman. 31. S.. Smith. T. F. Q Beyer. \V. X. (19i6). .-ldzurrc. Jllafh. 20. 36i-385.

S d e added i a pf: A weighting similar t.0 t.hat given ahove was independently developed by Walter Goad of Los Alsrnos Scientific Laboratory.

Smith and Water Man 1981

Documents

Transcript of Smith and Water Man 1981