Basic Local Alignment Search Tool

download Basic Local Alignment Search Tool

of 8

Transcript of Basic Local Alignment Search Tool

  • 8/3/2019 Basic Local Alignment Search Tool

    1/8

    J. Jlo/. huI. ( lBBO ) 2 15 . - iO :~ -- iIO

    Basic Local Alignment Search T 001Stephen F. Altschul', W arren Gisht, W ebb M iller2

    Eugene W . Myers3 and David J. Lipman1lN alion al ("en ter for fJiotech no log y lnfo rn wtim /N alional Libm ry ~f M edicine. Nationallnstitntes of llmll)'e/ile8da. M D 20894. I'.8.A.

    2 D epart'roenl of COnlnder S('in/('('rTjePennsylrania Sta L'nil'ersy. U n'er8ity f)ark. PA 16R02, {.').,1.A.aDepa rfrn en t (~l C omputer I-S ('ien ceUnir!!r8ity 4 A rizon a. 7'''C 80 11 . A Z 8,)721. U.S.A.

    (Rc ('n :1 !e d :N; Fcbn1 !1 ry llH IO: acc ep te d 1:' May 1 ! 1 f 1 l 1 )A new appl'Oach Lo rapid sequelH 'e (,O Inparison. basie lo('a] a]ignment sear('h tool (BLA";T).diJ'f'c1Iy apPl'oxi~.tes alignm ent." that optim ize a nlea:"U T'e o f local :"ilnilal'ity, U ll' m axim a]segH wl1t pair (Jl~P) s('orp. Re('ent m athem a.1i('al l'esu1t8 on the st(whasti(' pl'O pertie:" of ::\'I~Ps(,ol"es allow an analysis of t.he perfoJ'mafH 'e of this method as well as the stati:.;tiea]signifieanep of aligmnenh> it generatf's. TIlE ' basic algorithm i:'\ s irnple and robust: i1 can heimplementl'd in a num lwl' of ways and applied in a variet.v of contf'xt.s il1{'luding straight-forwanl D ~A ano .p1'O t.{ 'in s cqucnce databasf' sean'hcs. m o1.if Sf'aI'Chf:", gcnc identificationsp

  • 8/3/2019 Basic Local Alignment Search Tool

    2/8

    40 4 s. P. Alt8chul et al.In this papel' we describe a new method, BLASTt(Basie Local A lignment Search Tonl), whichemploys a mea~mre ba~wd on well-defined m utationReOfe". It diredly approximates the results thatwould be obtained by a dynamic programming algo-rithm for optimizing t.his measul'e. The mdhod wiIIdeted weak hui biologicalIy significant sequencesimilarities, and is more ihan an arder of magnitudefaster than ex ;ting heuristie algorithrm ;.

    2 . Me thods(a ) 7'h e nu u'irn al sl'yn u'n l p air rW'(l8 Ure

    R t'quenc(-' sim iJarity m pl\sure" gene rally can be classifiedas f>ithel' global Of' lo('al. G lobal sim ilarity algorithmsoptimize the overall ahgnment of two sf'quenees, , hichm a.v ineludl:' largf' stretehes of km ' sim ila.rit.v (N eedlem an& W unsch. 1970). Local sim ilarit,y algm it.hms "pek onlyrelatively ('onsen'pd subspqueIH :es. and a K inglp eom pal'i-son may yipld sevpral di:,;tinct Rubsequenee alignments:un('onKel'\'pd regions do Ilot eont.rihut.e to tllP m easul'e ofsimilai'ity (Smith & W aterrna.n. 1981; (;oad & Ka.nehi"a,19H2; Seller", 19S4). Lo('a,! sim ilaritv rnea~.;u('e" at'{-'W 'llprally preferrpd for databasp sea('('h~s. where eD~ Asmay be eompared with pa,rtially sequf'need gPlleS, andw here distantly I'eia.t.ed prot.pins m av share onl\" i"olatedregions of "im ilai'ity. e.g:. in t.he viei';ity of an a~,tivf' "ite.Many sim ilarity IIIPaSUl'ps, induding t.he (HlP weemplo,\". hegin \vith a matrix of :,;irnilaritv seOJ'es for all!,ossibJe pairs of refo;idues. ldf:'ntitif:'s an'd \:onservativere plae PIlH 'l fo ; h av e p os itiv e " eo refo ;, w hile u nlik ely re pla{ 'e-mentfo; han' nf:'gativp seore". For arnino aeid st'quene('{ 'omparisons wp genprally USf:' the ]'''\:\1-120 matrix (avariation of that of Dayhoff (" t a l .. 197H). w hilf' for D !\A: -1Pql lf :' n( 'f :' ( 'o rnpar is () lJ i' i Wf : ' ,,(:orp i(lentities +;'), andrnismat(,hps -4: otlH 'r S('0('('8 arf' or ('ourse posi'iihlf'. Ast'quelH'e sf'gnwnt IS a eontiguolls strf'teh of 1'f'sidlleS ofany I('ngth, alld the "irnilarity scon' for two alignpdst'gmpnts of the san1f' Icngth is tia' sllm of the "irnilaritvnLluf's fol' f'a('h pair oI' a liglH 'd rpsitiLw". 'C iypn the~.(' rulf's. W t' df't w a rnaxirnal s('gnlf'nt. pair(\JNP) to b(' th/:' highest s('oring pa.ir 01' identi('al lengthspgmpnts ('hosf'n frorn :? sequPIH 'es. The hOllndaries of an:\INP are ('hmwn to maxirniz

  • 8/3/2019 Basic Local Alignment Search Tool

    3/8

    Ra8ic Loca.l A lignm ent 8ean'h Tool 40 5\I,'ord can bc lI~ed as an index into an arrav of ~i",~204 = 160,000. Let t.he ith ~Ilt.ry or i:Hwh an afTa,~' point tothe li~t of all o('currences in the query sequenee of tht' ithword. ThllS, as wp ~wan the databLse. each database wordlf'adR us irnm f'diatply to the C OIT f'S ponding hits. T ypielL lly.only a few thousand of the ~04 possible words will be inthis table. and it is easv to mndifv the apprnach 1.0 use farfew er than ~04 pointe;s.

    .'l'he se('ond approa('h we pxplored for t,he scanningphase was the use of a deterministie finit.e automa1.on orfinit~ stat.p rnaehine (JIpal 1955: HOl)('['()ft & UlIman,1979). An important feature of our construet.ion was tosignal acceptance on tntnsitions (!\lealy paradigm ) ILSopposed to on st.ateR (lIoore paradigm). In the autmna-ton's construction. this saved a factor in space and timeroughl.v proportional to the size of the underl,yingall~alwt. This method ,yiplded a progra.m thaJ ran fasterand we prefer this approach for general use. \V i1.h typif'alquery lengths and parameter settings. this version nfBLART sean:,> a protein database at approximatel.v500.000 re: '> idl le s/ s.Extending a hit to find a locally maxirnal segmpnt pair

    conttining that hit i8 s1.ra.ightforw ard. '1'0 econom ize tim e,we terminate the proCf'SS- of extending in one directionwhen we reaeh a segment pair \I.'hose seore fallR a certaindistance bdow the best score found for shor1.er extensions.This introduces a further departure frorn the ideal offinding guaranteed !\ISPs. but the added inaccura.cy isnegligible. as can be demonstrated b.v both f'xperirnent.and anal.vsis (e.g. fol' protein t'omparisons 1.he defaultdistance is ~O. and the probability of m issing a highf'I'seoring extension is about 0'0(1).Fol' DNA, W f' u ~e a simpler wOT'd list., i.e. the list. of alleontiguous Ili-mers in the query sequenee, oft.en withw = 12. Thus. a query sequence of lengt.h n yields a list 01'1/-'/1)+ 1 words. alld again there are eommonly a fewthollsand words in the list.. It is advantageous to f'Omprf'SRthe da.tabase by paeking 4- nudeotides into a. single byte.lIsing an auxiliar)' tahle Lo delim it the boundaf'ies hf'tw Pf'n qu PJ Je es . A ss ur ni ng w ~ 1] , each hit mustcontain a.1I 8-rner hit t.hat liPR on a byte bounda.ry. Thisob~.f'rvation allows us to sean the data.base byte-wise andthereby irH 't"elsP spepd 4-fold. For each 8-mer hit., wecheck for an enclosing Ir-mer hit: if found. "../e extend asheft)!'e. Running on a Hl'N4-. w ith a query of typicallf'ngth (f'.:(. Rn-f'f'al thollsand haReR), HLAST s(~am; a.tapproximately 2 x 106 bltses/s, A l. facilities which runmany ~I/('h s('ardlPs a. day, loading the {'ompressed data-ba.se nto memory on('p in a. shared Hwmory sehemeaffords a suhR1.antia.1 saving in subsequent search times,11.should he noted that D~A seljuences are highly non-,'andom . with loeally biased base eomposition (e.g.A + T-rieh regionR). and r('peat.ed seqllence plements (f'.g.Al'/l sequences) ami this ha.s importa.nt coni'\equences 1'01'the design of a DNA database sear'('h t.on\. Tf a givpnquery sequenee haR, for exampl(', an A+T-rif'h sub-sequerH'e, O r' a ('ornm only O ('('U lTing repptitive plpm pl1f..then a datahas!' search will produce a f'opious output ofrnatdH 's with liH le inter'f'st. \Ve han:' designed ti1.utes all improvernent. \Ve alsoimplemented the alterna.tive of making a tahle of alloceurr('tl('es of t.hew -m ers in t,he dat.a.base. then scanningthe query sequence and proccRsing hits. The disk spacerequirem ent.s al'e considerable. a.pproxim ately 2 com puterwon18 for pvery residue in the database. More damagingwas that. for query sequences of 1.ypical length. the necdfol' random a.ccess into the datahase (as opposed tosequential access) made the apprors A andK for cvaluating the statisticai signiticance of .\.~p

    srores. \Vhen t\vo I'andom sequetlces of lengths man d n are com pared. the pI'oba.hility of tinding asegmen1 pair with a. s('ore greater than 01' equa! toJ .S i s;

    (1 )whel'e y = Kmn e..1.S :\'1 01 '1 >g en era Jl,) '. th e p ro b-ability oi" finding (' oI' m ore d;;tinc( segm en1 paiI's.all wi1.h a score ofat least 8. is givpn by' the formula:

    (2 )esing this form ula, tw o sequenees 1.h11.1harE ' severaldistinet. regions nf sim ilaI'i1.y can somet.imes bedetect..ed as signitieant,J,v related, even \vhen nosegm ent pair is statist.ic11.Ily significant in isolation,

  • 8/3/2019 Basic Local Alignment Search Tool

    4/8

    40 61.6

    12

    C;e 0.8,

    0.4

    8. F. A l t 8 c 1 1 1 l 1 el al.

    . I1111,1 ,

    1

    11J

    51 6015 24 33 42sF igure 1. T I... p,."h"hilit.v '1 nt BL,\~T m i>

  • 8/3/2019 Basic Local Alignment Search Tool

    5/8

    Li lwa r r f' gt 'f 's s O]1-In (q ) = (1:~'+/Probah!ty of aT hit x )5 " "II :!;;:{ 0'1:!:1i -1'11t);;12 I-n 0.OH7;; -O'i-HiI: \ H :~ O.o/):!;) -O';;iOl. .H (HI~t:i:) -O'.WI1; ; :!ti O'O;t~H -O-;{;;{-j l. 0'0:!:3:! -.0':2;31, 7 O 'OI;')H -O 'U)1IH . 0 '0 IO n -11'1:371: ) I ,'~ 0'[ ]B:! -1,278,l. 7S (HJHO-l -1'01:!1; ; -l7 O.OtiHf) -O..sO:!l-i :!H O.O;')I!) -O'():~~1, [ti IH);I!I() -11'-lOHIH !I I)-(I:!\I(I -0-:31-\7!t I ;) IH)215 -O':!OH:20 : IH)!;-,fJ -O':!:3.!1; ; ti-f (j'[I;J/ -I.:'):!;-,l i ." O'OSOS:! 1.:2071, :!;) I)-(Jti7fJ -IHJ:mIH ]; ) IH);"):!\I -0'7;,)4-1\1 !I O'O~I:) -O'()OH:20 ;) O'O;!:!i -1)':")0(;:2 [ :\ O'01;-,i -0'-1-:20:2:2 " IH):!(II) -1)':I.f;)

    40

    30

    E 20.=10

    Ba8ic Local A h'gnment ,,,'pan'h Too/ 40 7Table 1Thp p robabi li ty (4 a hit al various ,':wtting8 (4 l hp parame te r8 w an d T. and lhe

    proportiou (~j' randofn JlSP8 n88ed by BLA8T

    ;\

    ;")

    EX]J('(.t(.d no. o!' randorn :\1~P:-. \\ith S('O \'(' at It'Hst ;.....\

    I.II:!O:\:\.f(j;;!IiO"")11 1IH:!H.11;")1):2:1;1: 2:20:!B:IS.H:")i

    :")0

    50 ;j;"j tO (:"j

    ,

    70

    1111"IIHJ:W.111I

    "1")[:2:2:\:1';;)1"1I.!I]7

    :2 7:3S"",.Hl.:\ ]O.uo:!

    chalJ('f' of a hit. K \amining TabJt' 1, it is afJpal'pn1that tlw paranH 'tpl' pair's (u ' =:L T = 14), (1/' = 4.T = IH ) an d (U ' = ;'), T = lB ) a lt h an -' a pp r'o xim atl'l.vl'qui\'alpnt ~f'l1sit\'ity O\Tr 01(' r'p[p\'an1 I'ange ofcutofT ~('ol'(,S . '1 '11(' prohahiJity ()f a hit yil'ldl'd hy1hp~' paranH 'tl'r pail'~ is Sl''J} to dp('J'f-'asp rol'in('f'ea~illg /1 ': 11H ' san\(' also ho[dN for' difft'l''n t kn 'b()f sPI1~iti\'i1y. ,(,his rnakps intui1in ' St'IlS l', fol' t}wlonge!' tll' w (mI pail' t'xam il1('d tlw I}l(JI'P illfol'rna-tioll gailH 'd lhout poh'ntial :\1~Ps. \Iaintaining agi\''n ]Pn,1 of sCllsiti\'i1y. W ' ('al! tlwl'l'fol'p df'('l'pa~'tlw t in1l' ~p'J}t 011 st'p (:~), aho\'p , by nTt'asillg theparallH'teJ' w. IIO \H '\'e1', thf-'I'p a rp ( 'Ompl ' flw J} ta ryprohlPnls ('J'patpd by [,U 'gl' 1('. Fol' protpins thf-'J'p al'f-': ? ( ) W possibk wordN of Ipngth U', and rol' a gin'n kn,1o f spn~ it i\ 'i ty tlw numlwr of wordN gpnprated by aqUf-'J'Y grm\"s l'xpOlwntiaJly w ith 11'. (Fot" pxample,using Ow :~ par'11llwtP[" pairs aho\'e, a :W !'e~iduf-':-\PtjUPIW (' was foulld to gpneratf-' wOl'd lists of sizp2 !H i, : ~ )( )l am i 40,D :1D rf-'~Ipdi\'t'ly.) This il\cl'ea:-\pstl1(' time sppnt 011 stt'p (I), and the amount ofnwmol'Y rf'quir'f-'d. In pnwtilT, we have fonnd thatfor protei!l seardw :-\ tlw bf-'st ('Otnpl'orni:~\(' Iw 1\\'('elltlwsp ('onsider'aJions is w ith a \.\'onl sizp of foul': thisis the para,mptel' setting Wf-' \ 1St' in al! ;-\lHdysl's thatfollm\',Although !,pdu'ing t lit' tllrl'shold T impl'o\'PS t heappl'oximat ion of :\I~P s('orps by BLA~T, it aboin'r'past'~ ('xe'utio!l tinw hf'('ausp tlwrc will bp !1l0l'f'wOl'ds generated by tlw tju-'r.y sequc!we and thf-'I '-'-fore mol'P hits. \Vhat ndul' of T J )f 'O \Tidp :- \ a r eaSOl l-

    I:Hli.i:!H.1;);)Ii/

    11I.lO:!o:\:!.,(OOI.H](j

    :!!i:)7- 1 - \ )

    OI:1H17:2H4;3;)/

    ah[e ('om p,'()rnlsl' Iw 1\\Tl'n tlw ('onsi(lcl'a1iolls )f~ensiti\'ity all! till1e~ To mwid(' IH llI1l'l'ical data, \\T'()lI1llan'(1 a l'alH lom :?)() residw ' s(,lll('II('e < lgain~ttllt' ('ntire ['IR da1a.llase (R.f'J-'a~(-' :?:~.O, 14,:n:?'lItries and :1.D 77 ,!)o:~ r('siduf-'~) w it h T nlll~!ing from:?O 10 1:1. In Figul'p.) wp plot tlw pxp('utiol1 time(usel' tinH ' un a ~l ':'\+-:!HO) 1'/'1'8118 t he IH llllhf-'1' of

    I"

    1: 1:.W:!\I;!.;(HHj

    ,l.:2:3:\;)~t i;)/

    !I .,

    ~//////...

    ///

    O 2.5 7'5

    Figure 2. Tht' ('f'lltra! prm '('ssin!: ullit time I''quin,j topW t"uk BL\:-;T on tllP PIR protpin dt\tabas' (f{PIPHSP:!:~'O) as a fundion o' tlw sizf' of tlw \\onl list j.!pllf'l'ah 'd.P oints {'O IT 'S !JO IH I to va!lll's of the thI'P:..JlO ld IJ,uanw tpl' Trangillg frolll I;~ to :!O . (;I'pat:'t' n lluf's of T im ply ff-.\\'{'r\\"()J '(b in tllP Ist.

  • 8/3/2019 Basic Local Alignment Search Tool

    6/8

    q ( 0, ;) ) CP V t.im e (s )2 :m 25 17 12;) 25 17 12 [110 17 1:2 [1 720 12 [1 7 .;

    8. 44 55 70 90p-value H) O.R 0.01 10-5

    s. F. A ltschul et al.Table 271he ce ntr al p ro ce 88 ing 'unit time r equir ed tu e xe cu teBLAST as a fundion af the approxirnate probabilityq of m i88ing an J! S P with 8core S

    Times are fOI"searehing the PIR dat.abase (Release 23-0) wit,h arandom qU f'ry s~~qlH 'tl('t':'of length 250 usng a SCX 4-z80. CPL('en tral proee ss in g unit.wrds generated fol' each value of T. A lthough t.hereis alinear relationship between the nurnber ofwordsgenerated and exeeution t.ime, the number of wordsgenerat.ed in c reases exponentially w ith decreasing Tover this range ( as seen hy the spacing of x values).Th; plot and a simple analysis reveal that theexpeeted-t.im e com putational com plexity of RLASTis approximat .e ly aW +bN +cNW j20W, where W isthe nurnher of words generated, N is the number nfresidues in the dat.abase and a, b and e areeonstants. The W terro aecounts foI' compiling theword ist, the N term covers the database sean andth e NW terrn is fi)r extending the hit.s. A lthough t.herlumber of \vords generated, rr, increases exponen-tia lIy w itl- d ec re as in g T, it increases only linearlyw ith the length of the query, so thal doubling thequery length uouhles the number nf worcis. \V e -hav efaund in practice that T = ] 7 is a good choice foI'the thrcshold hecause, as discuRsed below , loweringthe para meter further provides little improvementin the detcction of actual homologies.BLAST's dred tradeoff behvccn accul'a,ey andspeed is best illustrated b:v Table 2. G iven a specificprobability q of m issing a ('hanee MSP with seore S,one can ealculate what threshold parameter T isrequired. and therefore the approximate exeeutiontime. Combining the data of Table 1 and Figure 2.Taule 2 shows the central proeessing unit - timesrequirf~d (for various values of q and 8) -to seareh thecurrent PIR da.tabase with a random querysequence of length 250. 1'0 have about a 10~{)chancc of missing an 1\'181' with the statisticallvsignifieant score of 70 requires about nine seeonds ~fcentral processing unit. time. To reduce the chaneeof rnissing such an lVISP to 2-; involves lowering T,thercby doubling the execution time. Table 2 illus-trates, furthermore, tha.t the higher scoring (andm ol'f' statisticaIJy significant) an l\1SP, the less tim eis required to find it w ith a given degree ofcert.ainty. -

    (e ) PeTformance 01 BLA8T withhorno logou8 8Pquence81'0 study the performance of BLAST on real data,\ve eompared a variety of protejns w ith other

    m embers of their respective superfamilies (Dayhoff,1978), computing the true )lSP seo res as well as theBLAST approximation with word length fout' amIvarious settings of the parameter T. Only withsuperfam ilies containing many distantly relatedproteins could we obtain results uscfully comparablew ith the random model of the previous sect.ion.SeaI'ching the globins w ith woolIy monkey rnyo-globin (PIR eode MYMQW), we found 17Rsequences eontaining MSPs w 'ith seo res between 50and 80. Using word length four and 'I' parameter 17,t.he random model suggests BLAST should missabout 24 oI' t.hese :'\ISPs; in I'aet., it m isses 4:3. Thispoorer t.han expeeted performance is due to theuniform pattern of conservation in the globins.result.ing in a relativcly small number of high-scoring words behveen distanUy related proteins. Acontrary example \vas provided by comparing themouse immunoglobulin K eha1 precursor V region(PIR eode KVM8TI) w ith immllnoglobulinsequences, using t.he same param eters as pre\/iously.Of the :33 MSPs with seores between 4,,) and 6,';.BLAST missed onlv t.wo: t.he random modelsuggests it should ha,,:e m issed eight. Tn general, thedistribution oI' mutations along sequeneeR has beenshown t.o be more elustered than predicted by aPoisson process (V7.7.ell & COf'bin, 1971), and thusthe RLAST approximation should, on average,perform bett.er on real sequences ihan predided bvthe random model. ~BLA8T 's gI'eat utilit.y is I'or finding high-Rcor'ing}ISPR quickly. In t.he examples above, ihe algo-rithm found all but one of t.he R9 globin }18Ps witha score over 80, amI all of the 12~~immunoglobulinl\tSPs wjth a seoI'C over 50. The overall perf()rm aneeof BLAST depends upon the dist.ribution of MSPscores for t.hose sequenees relat.ed to t.he query. Inmany instaneeR, t.he bulk of Uw 11SPs that aredistinguishablc I'rom ehanee have a high cnoughseore to be found readily by BLAST, even usingrelat.ively high values of the l' parameter. 'rabie :~shO\vs the number of l\'ISPs with a senre ahove agiven threshold found by BLA8T when searchjng avariety of superfamilies using a variety of 71 para-meters. In eaeh instanee, ihe t.hreshold /) is ('hosento in e lude Sl'ores in the borderline region, which in afuI! database sear('h would indllde ('hanee sim ilar-it.ies a s well a s b io lo gie ally significant l 'e lat ,ioTlships.Even \vith T equal to ] H , virtually a11 the statist,i-cally significant :\1SPs are found in most im ,tanees.Comparing BLAST (with pararneterRW = 4.71 = 17) to the \videly used FASTP program(Lipm an & Pearson 1985: Pearson & Lipm an, 19H8)in iis rnost. scnsitive m ode (ktup = ]), we have foundt.hat BLAST is of comparable sensitivit:r" gene rallyyields fewer false posit.ives (high-seoring but 11nre-lated matehes to t.he ql1er.y). and iR over' an order ofm agnitude fa.ster.(d ) Compari8on of tlCO long DN A 8equence8

    Sequence data exist fol' a 7:~,:JGO bp seet.ion of thchuman genome cont.aining the f:J-like globin gene

  • 8/3/2019 Basic Local Alignment Search Tool

    7/8

    Number of J1SP~ with Reore at least 8 N um lW f 0 1" l\ lS Psfound by BLAST with T parameter set tu in su pe rf am i lyCutoff w ith R eoreseore 8 2 2 2 0 1 9 1 8 1 7 16 l.'i at least 8

    47 115 169 17 8 222 2:IH 255 281 28 547 153 15 5 15,') 1;")6 15 6 157 1.58 5 852 9 42 47 59 60 60 60 6050 12 12 12 12 12 12 12 1249 5 9 5 9 ;, 9 5 9 5 B .5 9 5 9 f> 94 6 8 1 9 1 9 1 9 6 98 98 98 9844 22 2~ 2: J 24 24 24 24 U

    w Time \Vo!"ds Hit:,; l\1atchesK lfj.O 4 4 . 5 H 7 IIHJ)41 I:W9 6,8 4 4 . . ' J H 6 :3~L:!18 l~:~10 4.3 44.5H;") l.:tn 11 4II :~.5 44.584 7:34;) 10 612 :1.2 44.GH:{ 41!J7 08

    .Basie Local Al-ignment 8earch rrool 40 9

    Table 3T he n umbe r o f M l" '[J 8fo un d byBLA8T when searching var'iO'U8proteins1L perfamilies in th e r TR d ata ba ,,, (R elea se 2 2.0 )

    PIR cacle ofq ue r) ' s eq uem'eSuperfamily

    searehedMn!QWKV.\lSTlOKJJOGITHl1KYBOACCHllFECF

    GJobinTrnrnunoglobulinProten kinaseSerpinS er in e p ro te a se('ytochrome eFel'redoxin

    M YM Q,W . w oally m onkey rnyoglobin:kinase: ITH U. hum an .:x-l-antitr} 'psinChlorobinm sp . [ er redox in ,KV:\lSTl. mouse Jg lo: ehain precursor V region. OK130C, bovine cC.\lP-dependent proteinprecursor; KYROA, bovine chymotrypsinogen A: CCHU. human eyt,ochrome e; FE('F,

    elus1.er and fol' a corresponding 44,f:i95 bp section ofthe rabbit geno me (~lal'got el al., U JH !J). T be pairexhibits three main classes of locally sim ilar regions,namely genes, long interspersed repeat.s and certainanticipated weaker sim ilarities, as dcscribed below .\Ve used the BLA8T algorithm to loeate loeallysim ilar regions that can be ahgned without intro-duction of gaps.Th~ human gene cluster cont.ains six globin genes,d en ot.cd t:, Gr, A r', ry , b an d {J , while the rabbit dllsterhas only [OUL namel)' 8, )', b am i {J . (A dually, rahbi1.b is a pseudogene.) Each of the 24 gene pairs, onehuman gene ami one rabbit, gene, consti1.utes asimilar pair. An alignment of such a pair requiresinsertion and deletions, since the three exons of onegene gene rally differ somewhat in their lengt.hs ffomthe eorresponding exons of 1.he paired gene, andthere are even more extensive variations among theintrons. Thns, a collection of the highest seoringalignments between similar regions can be expededto have at least 24 alignments between gene pairs.IV lammalian genomes contain large nllmhers oflong interspersed repeat sequences, abbreviatedLI N ES. In pal,tieular, the human {J-like globineluster con1.ains 1.wo overlapped L 1 sequences (atype 01' LI N R) and the rabbit duster has two1.andem LI Reqnences in the Rame orientation, ho1.haround (iOOO bp in leng1.h. These human and rabbitLl sequpnces are quite sim ilar and their lengthsmake them highly visible in sim ilari1.y eompu-ta1.ions. In al\' eight L 1 sequences have been eit.ed inthe hum an ('lust.t:'r and five in 1.he rabhit clust.er, hutbecause of t.heir reduced leng1.h and,lor revel'Redorienta.tion, the other published Ll sequt'nees donot affect 1.he results discussed belm v. V er,)' recenU y,another pieee of an L 1 sequence has been diseoveredin the rabbit eluster (Huang e t a l., HI90).Evolution 1.heory Ruggests that an ancestral geneeh1R ter arrangt:'d as 5/-B-)'-ry-b-{J-:r may have existedbefore the mammalian radiation. Consistent. w iththis hypothesis, 1.here are in ter-gene sim ilaritiesw ithin 1.he f3 elllst.erR. For exam ple, there is a region

    bet.ween human t: and Gr, that. is sim ilar to a regionbetween rabbit t: and y'.\Ve applied a variant. of t.he BLAST program 1.0theRe two sequences, \vith mat.ch score [) , mismatchscore -4 and, initially, te = 12. The program found98 alignmentR scoring over 200, w ith 1301 being thehighest seo re. Of 1.he 57 alignments scoring over 350.45 paired genes (w ith each of t.he 24 possible genepairs represented) and 1.he remaining 12 involved Llsequences. Relow 350, inter-gene silIlilarities (asdescribed ahove) appear. along with additionalalignments of genes and of L 1 Hequenees. Two align-ments w ith scoreH between 200 ami :~50 do not ti1.the anticipat.ed pat.t.ern. One I'eveals the newly di s -covel'ed sect.ion of L 1 sequenee. The othel' alignR areginn im mediately ,r:i'from the hum an fJ gene with aregion just r/ from rabhit b. T his las1. alignm entmay be the resul1. of an int.ra(.hl'Omosomal geneconversion betw een b an d {J in the rabbit genome(HardiRon & ~Iargot. 1984).\V ith smaller values of w. more alignnwnt.s al'efound. In particular, w ith w = H, an addit.ional 32alignment.R are fonnd with a s('orp above 200. A ll ofthese fall in one of the thrt'f' elaR sPR diR C'uR Redaboyl'.Thus, use of a smaller w provides no essentiall,y newinformat.ion. The dependence of variouR values on wis given in TablE' 4. Time is meaHurl'd in se('onds ona Sl':"oJ4 fol' a simple variant of HLAST tha1. works\v i1 .h u ncompl'es sed DXA seq ul'nct's .

    Table 4The time and 8en8itivity of 8LAST onj) N A 8equeuce8 a8 a functiun of w

  • 8/3/2019 Basic Local Alignment Search Tool

    8/8

    41 0 .S . P . A lt8( 'hu .1 f :' t a l.4 . Con clu sio n

    '!'11e concept underlying BLAHT is simple amIrobust and thereforf' ('an he implemenied in anumber oI' ways a.nd utilized in a variety of('ontexts. As mentioIlf'd a.bove. one variation is toa,llow foI' gaps in the extensioll step. For the applica-tion:-; we have had in m ind, the tradE'off in Rppedpro ved una('ceptahle, hut this may not. l>p tI'tIt' fol'otlwr appli('ations. \Ve have implerncnted a sharednwmory ven:;iof} of TILAHT that loads tlw('ompT'l'ssed D~A fije into rnem ory once, allow ingsubsequent sparches 1.0 skip thi:'-\ step. \V e .Found., \Vu'shington, D C.D ayhoff. N I. O .. Hehwart:z. H .. M . & On'utt. B . ('. (197H).In Atlas rd P ro fe in ,S pq lle nc e { ln d 8 tr uc tu If (Da.,vhoff.1\'1.O .. ed.). vol. 5. suppl. 3. pp. :J4!) :3.r>2. Xa.t.B iom ed. Rps. Found.. \V Lshington. !W .D t'mbo. A . & Karlin. H . (lBnl). Ann. Pro/. i n t l1 (' p l'f ': -> :->.G oad. \V . B . & Ka.nehisa, 1\-1.1. (lB8:?). .vue l. A rids Rf8.10 . 24 -~( i: J.G otoh, O . & Ta.gashira. Y . (19Sn). Xurl. Af'id.~' Res. 14.5-64,H ardison, R . C '. & M al'got. ,1. B. (lBH 4). .l/a l. n io l. /1 ;1 '0 1.1.30: !- : 116.Hopc['oft..1. E. & l'lIma.n..1. n. (1970). In /nfroduction toAatomafa Tfeory, Lan{tlwrrs. and Crnnputatm,pp. 4 2-4-5. A ddison-\V t~sley , R eadin g;. :\IA .Huang. X ., Hardison. R . C . & M ilh-'l', W . (1990). ('ompaf.Appl. Bios . In the prt-'SS.

    K a.!'lin, S. & Altschul. H . F. (l990). Proc. .Val. Acad. ,,,'('i.,(-.s.A. 87. 22 Gc\ 22(;8.Km'lin. S.. Demho, A . & Ka.wa.hata, T. (IHHO). A nn . .','ta t.18.571-581.Lipman. D . ,1. & Pearson. \V . R . (1985). ' 'deur' f' , 227.1 4:),3 1 44 1.:\Targot. .1. B.. D f'nH "rs. C. \V . & Hardison. H . C . (IHH9).J. Mol. Biol. 205. 15-40.:\I{ -'d y, C . H . (1 95 ;5 ). U el!8Y 8lnn T ec h../. 3 4, 104-;) 1079.Nppdleman. H . B . & \Vunsch, ('. D . (l!nO ). .1. M ol. niol.48. 44: )- 4,3 3.pparson. W . H .. & Lipman. D ..I. (1988). P ro('. N r1f. A wd.Sci.. (-.8.A . 85_ 2444 2448.Sankoff. D . & Kruskal. .J. B. (198:J) Tinu' Ifarp8, Slriny

    fi}r hf8 a nd .J l(J ('r om olfc u1 f's : T hr T hPOr y o ud /)m cf 'c e 01,,,'e qu enN ( 'ompar i.w nl , A dd is()lI-\V psl"y . f{ .p ad in g.l1A.N ellers, P . H . (J!J74). SIA M J. Appl. ;llath. 2 6. 8 7-7H:~ .H ellel's. P . H . (lBS4). Hul!. Malh. Biol. 4 6. 5 01 -5 14 ..\m ith. R . F. & Smit.h. T. F. (1\)\)0). Proc. Sat. Amd, 8ei..(-.S.A. 87.118 122.Sm ith. T. F. & \Vatf'rTllan. .\1. S. (IB81). Ad/nu. Appl.JIu/h. 2 . 4 82 -4H9 .C:z:zpll. T . & ('mhin. K . \V . (1\)71). S r~ if'n r'e . 1 72 .108!!-1O96.\Va tf 'J 'r na .n .: .\1 . S . ( 19R4) . B II1/. ,l/alh. B iol. 46 . n:~ 500.

    Rdited by 8. Hrennrr