Sequence Comparison: Pairwise Alignment · Global alignment A global pairwise alignment is one...
Transcript of Sequence Comparison: Pairwise Alignment · Global alignment A global pairwise alignment is one...
SequenceComparison:PairwiseAlignment
ShifraBen‐Dor
IritOrr
Theproblems:
IhaveaDNAsequence:Whatdoesitdo?
possiblecodingregion
possibleregulatoryregion
Ihaveaproteinsequence:
Whatdoesitdo??
SequenceComparison
• Generally,sequencedeterminesstructureandstructuredeterminesfuncHon
• Bystudyingsequencesimilarity,wehopetofindcorrelaHonsbetweenoursequenceandothersequenceswithknownstructureorfuncHon
• ThisapproachisoKensuccessful,howevermanymoleculeshavelowsequencesimilarity,yetsHllsharesimilarstructureorfuncHon.
SequenceComparison
• MoHfs/Domains‐similarityoversmallstretches
• Sequencefamilies‐similarityoverlongersequences
• Comparisoncanhelpuswith:
• structure
• funcHon
• evoluHon
ComparisonQuesHons:
• Arethesequencesrelated(homology)?
• Canwequalifytheirsimilarity?
• Dotheyhavesimilarsegments?
Terminology:
• Homology
• IdenHty
• Similarity
Homology
• Commonancestry
• Sequence(andusuallystructure)conservaHon
• HomologyisnotameasurablequanHty
• Homologycanbeinferred,undersuitablecondiHons
IdenHty
• ObjecHveandwelldefined
• CanbequanHfiedbyseveralmethods:
• Percent
• ThenumberofidenHcalmatchesdividedbythelengthofthealignedregion
Similarity
• Mostcommonmethodused
• Notsowelldefined
• Dependsontheparametersused(alphabet,scoringmatrix,etc.)
Whatarewecomparing?
• DNAorRNA
• Fournucleicacids(basicset)
• Protein
• Twentyaminoacids(basicset)
Alignment
• Analignmentisanarrangementoftwosequencesoppositeoneanother
• Itshowswheretheyaredifferentandwheretheyaresimilar
• WewanttofindtheopHmalalignment‐themostsimilarityandtheleastdifferences
Alignment
• Alignmentshavetwoaspects:
• QuanHty:Towhatdegreearethesequencessimilar(percentage,otherscoringmethod)
• Quality:Regionsofsimilarityinagivensequence
TheopHmalalignmentoftwo
sequencesisonethatfinds
thelongestsegmentofhigh
sequencesimilarity.
Howisanalignmentdone?
• Whenwecomparesequences,wetaketwostringsofleXers(nucleoHdesoraminoacids)andalignthem.
• WherethecharactersareidenHcal,wegivethemaposiHvescore,andwheretheydiffer,anegaHvevalue.
• WecounttheidenHcalandnon‐idenHcalcharacters,andgivethealignmentascore(usuallycalledthequality)
Differencesinthesequencecanbe
causedbydeleHonsorinserHonsin
theDNA,orbypointmutaHons.These
changescanbeseenattheproteinlevel
aswell(changesinthetranslaHonof
theprotein)
Thisschemeworksfineaslongas
youassumethatallpossiblemutaHons
occuratthesamefrequency.
However,naturedoesn’tworkthisway.
IthasbeenfoundthatinDNA,transiHons
occurmoreoKenthantransversions.
Purines(A,G) are2‐ringbasesPyrimidines(C,T)are1‐ringbases
TransiHon:purinetopurineor pyrimidinetopyrimidine
Transversion:purinetopyrimidineorpyrimidinetopurine
TransiHonsconserveringnumberTransversionschangeringnumber
takenfromMolecularCellBiology,DarnellLodishBalHmore1990
Forproteins,thesituaHonisfarmorecomplex
• AminoacidscanbegroupedbyanumberofclassificaHons:
• Chemical:aromaHc,aliphaHc,sulphuric
• FuncHonal:hydrophobic,hydrophilic,acidic,basic
• Charge:posiHve,negaHve,neutral
• Structural:internal,external
ScoringMatrices
• Scoringmatricesareusedtoassignascoretoeachcomparisonofapairofcharacters
• ThescoresinthematrixareintegervalueswhichassignaposiHvescoretoidenHcalorsimilarcharacterpairs,andanegaHvevaluetodissimilarpairs
• Thematriceswereconstructedbyanalyzingknownfamiliesofproteins
Anexample:Blosum62Henikoff&Henikoff
A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5
Alignmentalgorithms
• Visualalignment
• allowsintegraHonofrelevantdatanotavailabletocomputerizedalgorithms
• Timeconsuming,notfeasibleforallbuttheshortestsequences
• Fixedlengthalgorithms
• donotconsiderinserHonsanddeleHons• inserHonsanddeleHonsareneededevenforcloselyrelatedsequences
AlignmentAlgorithms
• Thenaïveapproach:
• generateallpossiblealignmentsfor2sequences(includinggaps)andchoosethealignmentwiththehighestscore
• TooHmeconsuming
Dynamicprogrammingalgorithms
• Eachcharacteralongbothsequencesisevaluated.AteachposiHontherearefourpossibilites• idenHty
• subsHtuHon
• deleHoninsequence1
• deleHoninsequence2
Dynamicprogramming
• IdenHcalcharacters(matches)orsubsHtuHons(mismatches)arescoredaccordingtoamatrix.
• DeleHonsineitherofthesequencesarecalledgaps.
• GapsaregivenanegaHvescore,referredtoasthegappenalty
Thealignmentisgivenascore,calledthequality
Quality=matches‐(mismatches+gappenalty)
Theprogramwillfindthealignmentwiththehighestquality.
ThechoicebetweengapsandsubsHtuHonsismadetogivethehigherqualityofthetwo.
TheGapPenaltyConsiderthetwofollowingalignments:
V I T K L G T C V G S V I T K L G T C V G S
V I T . . . T C V G S V . T K . G T C V . S
Accordingtothealgorithmthese2caseswillgetthesamegappenalty:
Match=3Gap=‐2
8(3) + 3(-2) = 18 8(3) + 3(-2) = 18
Howevernatureisdifferent.Inmost
casesinserHons/deleHonsarelonger
thanasingleresidue,evenforvery
similarsequences.
Tocompensateforthis,andtodifferenHatebetweencasesliketheoneabove,thegappenaltyismadeupoftwofactors:
ThegapcreaHonpenalty‐subtractedfromthealignmentqualitywheneveragapisopened.
Thegapextensionpenalty‐subtractedfromthealignmentqualityaccordingtothelengthofthegap.
Thuswehave:
Quality=
matches‐(mismatches+gappenalty)
Gappenalty=
gapcreaHonpenalty+(gapextensionpenaltyXgaplength)
TheGapPenaltySonowwehave:
V I T K L G T C V G S V I T K L G T C V G S V I T . . . T C V G S V . T K . G T C V . S
Match=3Gapopen=‐4Gapextension=‐1
8(3)+[1(‐4)+3(‐1)]=178(3)+[3(‐4)+3(‐1)]=9
Gappenaltyparameters
InserHonofagapmustimprovethequalityofthealignment(raisethequalityscore).
IfthegapcreaHonandgapextensionpenalHesarehigh,lessgapswillbeinsertedintothealignment.
IfthegapcreaHonandgapextensionpenalHesarelow,moregapswillbeinsertedintothealignment.
SoifyouareinterestedinanalignmentbetweentwoverysimilarsequencesthegappenalHesshouldberaised,toreducethechancesofgejngsomethingrandom.
IfyouareinterestedindetecHnghomology(findingaweaksimilarity)betweentwodistantlyrelatedsequencesthegappenalHesshouldbelowered.
Ifyoudon'tknowwhattoexpect,startoffwiththedefaultparameters
Tosummarize: Alignmentscoresaredependentonwhatwechoosefor:matches,mismatches,subsHtuHonsandgaps.
Dynamicprogrammingcanbeusedforglobalorlocalalignment
Twotypesofalignment:
• Globalalignment
• Localalignment
Globalalignment
Aglobalpairwisealignmentisonewhereitisassumedthatthetwosequenceshavedivergedfromacommonancestorandthattheprogramshouldtrytostretchthetwosequences,introducinggapswherenecessary,inordertoshowthealignmentoverthewholelengthofthetwosequencesthatbestillustratestheirsimilariHes.
Globalalignment
• Comparessequencesandgivesbestoverallalignment
• Mayfailtofindthebestlocalregionofsimilarity(suchasasharedmoHf)amongdistantlyrelatedsequences
• Will(generally)returnonlythebestmatchingsegmentforagivenpairofsequences
Globalalignment
• TheclassicalalgorithmforglobalalignmentistheNeedleman‐Wunsch
LocalAlignment
• SearchesforregionsoflocalsimilaritybetweentwosequencesandneednotincludetheenHrelengthofthesequences.
• Findsregionsof(ungapped)sequencewithahighdegreeofsimilarity
• BeXeratfindingmoHfs,especiallyforsequencesthataredifferentoverall
• Canreturnmorethanonematchingsegmentforagivenpairofsequences
LocalAlignment
• TheclassicalalgorithmforlocalalignmentistheSmith‐Waterman
SequenceComparisonPrograms
• Global• Gap(GCG)
• Align(Fasta)• Needle(EMBOSS)• Stretcher(EMBOSS)–modifiedtoconservememory,goodforlongsequences
SequenceComparisonPrograms
• Local• Besmit(GCG)
• Lalign(Fasta)–canreturnmorethanonesegment• Matcher(EMBOSS)‐basedonlalign,canreturnmorethanonesegment
• Water(EMBOSS)‐Smith‐Waterman,onlyonehit
LocalpairwisealignmentusingBL2SEQatNCBI
ThistoolproducesthealignmentoftwogivensequencesusingBLASTalgorithmforlocalalignment.
Reference:TaHanaA.Tatusova,ThomasL.Madden(1999),"Blast2sequences‐anewtoolforcomparingproteinandnucleoHdesequences",FEMSMicrobiolLeX.174:247‐250
LocalpairwisealignmentusingBL2SEQ
ThistooluHlizestheBLASTengineforpairwisesequencecomparisonandisbasedonthesamealgorithmandstaHsHcsoflocalalignmentsthathavebeendescribedintheBLASTpaper.
TheBLASTalgorithmgeneratesagappedalignmentbyusingdynamicprogrammingtoextendthecentralsegmentofalignedresidues.
Becausetheparameterswerebasedondatabasesearching,somemayhavetobechangedtofindamatch
StaHsHcalEvaluaHonofAlignments
TheproblemwiththeseprogramsisnomaXerhowdissimilarthesequencesyoucompare,theprogramswillalwaysalignthem.
Evena5%idenHtywillbedisplayedasavalidresult.
SohowcanyoutellifthealignmentisstaHsHcallyvalid????
TherandomizeopHon
TherandomizeopHonwilltakethesecondsequenceyouinputandshuffleit,toobtainarandomsequencewiththesamecharactercomposiHon.
Thisrandomsequencewillbecomparedtothefirstsequence,usingeitheraglobalorlocalalgorithm(thesamethatyouusedoriginally),andaqualityscorewillbeobtained.
TherandomizeopHon
ThisprocessisrepeatedanumberofHmes,specifiedbytheuser,inordertoobtainapopulaHonofsequencesthatcanbeusedforstaHsHcalanalysis.
Thequalityofthesealignmentswillbeaveragedandcomparedtotheoriginalquality,andthenbeusedtogiveastaHsHcallymeaningfulanswertothealignment.
Theprogramgivestheoriginalalignment,
theoriginalquality,andanewqualityscore
madefromtheaverageoftherandomized
alignments+/‐thestandarddeviaHon.
Thesevaluesareusedtocalculatethedistanceofyouroriginalqualityfromthemean(Z‐score).
Z‐score=(OriginalQuality‐AverageQuality)StandardDeviaHon
AZ‐scoreofover4isconsideredsignificant.(MeaningthatthealignmentisstaHsHcallysignificanttoo)
AlternaHvestoGCG
• IntheFASTApackage,therearetwoalternaHveprogramsforstaHsHcalanalysis:
• PRDF‐calculatestheprobabilityofasimilarityscoremoreaccuratelybyusingafittoanextremevaluedistribuHon.
• PRSS‐aversionofPRDFthatusesarigorousSmith‐WatermancalculaHontoscoresimilariHes
(Theseprogramsareavailableontheweb)
Dotplotsaretwodimensionalgraphs,showingacomparisonoftwosequences.
Thetwoaxesofthegraphrepresentthetwosequencesbeingcompared.
Everyregionofthesequenceiscomparedtoeveryregionoftheothersequence.
Dotplots
Dotplots
Dotplojngisthebestwaytoseeallofthestructuresincommonbetweentwosequences.
Dotplojngcanalsobeusedtoviewrepeatedstructuresorinvertedrepeatsinasinglesequence.Thisisaccomplishedbycomparingasequencetoitself.
Dotplojnghelpsrecognizelargeregionsofsimilarity.InmostcasesitisnotsensiHveenoughtoseesmallstructures.
ComparisonCriteria
Thematchcriterioncanbemetintwodifferentways:
Thewindow/stringencymethod.
Thewordmethod.
Thewindow/stringencymethod
Searchesforalltheplaceswhereagivennumberofmatches(stringency)occurwithinagivenrange(window).
ThismethodismoreHme‐consuming,butmoresensiHve.
Comparisonsaredoneaccordingtoascoringmatrix.
Mustbespecifiedonthecommandline(‐wordsize=X,whereXisthesizeyouchoose).
Searchesforshortperfectmatchesofasetlength(words).
Thismethodisabout1000Hmesfasterthanthewindow/stringencymethod,butismuchlesssensiHve.
Ifthesequencesdonotcontainshortperfectmatchesthenthismethodwillfindnothing.
Thewordmethod
HintsIfyouhavelongsequences,tryawordcomparisonfirst.Thisismuchfaster,andwillgiveyouanideaofwhatthedotplotforthemoresensiHvewindow/stringencymethodwilllooklike.
Whenusingthewordmethod,startoffwithawordsizeof6fornucleicacidsequencesofupto1,000bases,or8forsequencesofupto10,000.
Hints
ForpepHdesequences,startoffwithawordsizeof2‐3.
Whenusingthewindow/stringencymethodstartoffwithawindowof21andastringencyof14fornucleicacids.
ForpepHdesequencesstartoffwithawindowof30andastringencyof11.
Programsfordotplots
GCG– Compare(createthepoints)
– Dotplot(plottheoutputofcompare)
SeqWeb(GCG)– Compare
EMBOSS– Dotmatcher‐window/stringency– DoXup‐wordplot– Dotpath‐non‐overlappingwordplot– Polydot‐allagainstallwordplot
AlternaHve“dotplots”
DoXerisagraphicaldotplotprogramfordetailedcomparisonoftwosequences.
Tomakethescorematrixmoreintelligible,thepairwisescoresareaveragedoveraslidingwindowwhichrunsdiagonally.Theaveragedscorematrixformsathree‐dimensionallandscape,withthetwosequencesintwodimensionsandtheheightofthepeaksinthethird.
Thislandscapeisprojectedontotwodimensionsbyaidofgreyscales‐thedarkergreyofapeak,thehigheritis.
DoXerprovidesatooltoexplorethevisualappearanceofthislandscape,aswellasatooltoexaminethesequencealignmentitrepresents.