Download - Basic Models of Nucleotide Evolution Report

Transcript

BasicModelsofNucleotideEvolutionOvertime,nucleotideswithinasequencecan‘evolve’throughsubstitution.Thisprocesscancauseanucleotide(T,C,AorG)tochangeintoanothernucleotideandisthemaindrivingforcebehindevolution.Forexample,thenucleotideAinasequenceofDNAcanchangeovertimeintothenucleotideC.ThischangemayresultinthissequenceofDNAbecominginactiveifthesequencewaspreviouslyinvolvedinproteinsynthesisasanexon,ormaychangetheproteinthatthesequencecodes.Asproteinsarethebuildingblocksoforganiclife,thismaycauselargechangesinanorganism’sfeatures.Alternatively,thischangemayhavenoeffectatall.Onaverage,thisformofmutationonlyoccursonceortwiceeverymillionyears.However,inassessingtheevolutionofspeciesoverhundredsofmillionsofyears,modelsareusefulinevaluatinghowonesequenceofnucleotidesmayhaveevolvedfromanother.ModelsofnucleotideevolutioncanbeusedwhenexaminingtwosequencesofDNAofthesamelengththatmayberelated.Thistypeofmodelwouldbeusedtocomparethetwosequencesbyeitherassumingthatonesequenceevolvedintotheotherorvice‐versa,orassumingthattheyhadevolvedfromacommon‘ancestral’sequenceofDNA.Applyingthemodelwouldgivetheestimatednumberofnucleotidesubstitutionspersite,calledthedistance,whichwouldthenbeusedtoestimateatime.Thistimecouldthenrelatetowhenonesequenceevolvedfromtheotherorwouldrelatetohowlongagothatan‘ancestral’sequenceofDNAwouldhavedivergedintoeachsequence.Inthispaper,Iwilloutlinetheprinciplesandtheorybehindthemain(mostcommonlyused)modelsofnucleotidesubstitution,addressingeachmodelchronologicallyandinsomesenseswithincreasingcomplexity.Themodelsareasfollows:

o JukesandCantor1969(JC69)o Kimura1980(K80)o Felsenstein1981(F81)o Hasegawa,KishinoandYano(HKY85)o TamuraandNei1993(TN93)

Iwilldemonstratehowprogrammingsoftwaremaybeusedtoprocessdatausingtheformulaeproposedwithineachmodel.FromthisIwillexplainhow,continuingtouseprogrammingsoftware,eachmodeliscapableofsimulatingtheevolutionofanucleotidesequenceoveragiventime.JC69ModelIntermsofcreatingmodelsthatassessnucleotidesubstitution,therateofsubstitutionfromonenucleotidetoanotherandthetimeoverwhichsubstitutionhasbeenallowedtoactarekeyvariables.Differentmodelsorganisetheiruseofratesindifferentwaysbuttimeisalwaysusedinthesameway.ThesimplestmodelofnucleotidesubstitutionistheJukesandCantor1969(JC69)model.Thismodelassumesthattherateofsubstitutionisthesamebetweenallnucleotides.Therefore,thismodelonlyrequiresasingleparameter‐denotingrate,alongwithavaluefortime.A4x4matrixcanbecreatedshowingtheratesofnucleotidesubstitutionbetweenthe4nucleotides.ThisisknownasmatrixQ:

Q=

Alongthediagonalofthismatrix,youcanseethattheratesofnucleotideschangingintothemselvesarenotdisplayed,astheyarenotregardedassubstitutions.Also,therowssumto0.UsingtheratesinmatrixQ,wecanworkouttheprobabilityofeachnucleotidesubstitutionoccurringwhent>0,creatinganothermatrix.Thismatrixisknownasthetransitionprobabilitymatrix(P(t))andisalsoa4x4matrix:P(t)=

Theseformulaecalculatetheprobabilityofonenucleotideevolvingintoanother.TheyareachievedthroughtheexponentiationoftheMatrixQusingtheMatrixTaylorseries.IntermsofusingthematrixP(t)withreal‐worldorexperimentaldata,aprogramcanbewrittenwhichwillcalculatethetransitionprobabilitiesofeachnucleotidesubstitutionusingtheformulaeinP(t).Pythonisprogrammingsoftwarethatprovidesabasicbuteffectiveprogramminglanguage,whichcanbeusedinthesecircumstances.WemustfirstdefineafunctionthatwillimplementtheformulaeofthematrixP(t)whengivencertainvaluestoworkfrom.Thesevaluesarecalledparametersandinthecaseofworkingoutthetransitionprobabilities,wemustinputavaluefortherateatwhichnucleotidesubstitutionswilloccuraswellasavalueforthetimeoverwhichsubstitutionswilloccur.

Thefollowingcode,writteninPython,emulatesthematrixP(t):

Asshownatthebottomoftheimage,inputtinganexperimentalrate(0.2)andtime(1)teststhefunction‘JC69’usedtocalculatethetransitionprobabilities.ThisisfollowedbyamatrixdisplayingtheprobabilitiesrowbyrowwithnucleotideorderT,C,AandG,inthesameorientationasthematrixQ.Inlookingattheformulaeusedtocalculatethetransitionprobabilities,conclusionscanbemadetohowtheincreasingrateortimewillaffecttheresultantprobabilities.

Theexponential(exp)ofanegativevaluegivesadecimalnumbersmallerthan1.Ifthenegativevalueincreasesinsize,theexponentialofthatvaluebecomessmalleratanincreasingrate.Therefore,asthenegativevaluetendstoinfinity,theexponentialofthatvaluetendsto0.InlookingattheaboveformulaeXandY,asthevaluesofm(rate)andt(time)increase,thevaluesbeingaddedto¼inXandsubtractedfrom¼inYbecomeinfinitelysmaller.Thisresultsinthetransitionprobabilitiestendingtowards¼foreachnucleotidesubstitution.Thissupportstheassumptionthatoveranincreasedtimeorrate,somanynucleotidesubstitutionswouldhaveoccurredthatthetargetnucleotideiseventuallyrandom,withaprobabilityof¼foreachnucleotide.

Thisisdemonstratedinthefollowinggraph,takingincreasingvaluesforratewithaconstanttimeof1:

Pii(t)representstheprobabilitythatanucleotidewillnotexperienceasubstitutionoveraperiodoftime(t).Pij(t)representstheprobabilitythatanucleotidewillexperienceasubstitutionandevolveintoanothernucleotideoveraperiodoftime.Atthepointwhentime=infinity,overwhichanucleotidesequencehadbeenallowedtoevolve,theproportionofnucleotidesofeachtype(T,C,A,G)willhavereached¼foreach.ThisdistributionofnucleotidesiscalledthelimitingdistributionandastheratesofchangearethesameforallnucleotidesintheJC69model,thisproportionwillbemaintained.Thisproportionalequilibriumiscalledthestationarydistribution.K80ModelKimuraandassociatescreatedamodelproposingamorecomplexmixofratesbetweennucleotidesubstitutionsin1980.ThismodeliscommonlyknownastheK80modelandusestworatesasparametersalongwithtime.Nucleotidesubstitutionscanbeclassifiedasoneoftwotypes;transitionsandtransversions.Transitionsaresubstitutionsbetweennucleotidesofthesameorsimilarmolecularstructure;betweenpurinesorbetweenpyrimidines,andarepronetooccurmorefrequentlytoothersubstitutions.NucleotidesAandGarepurinemoleculesandexperiencehighersubstitutionsbetweeneachother,aswellasnucleotidesTandCwhicharepyrimidinemolecules.Allothersubstitutionsaretranversionsandareknowntooccurlessfrequentlythantransitions.In1980,thefirstmitochondrialsequenceswerepublishedshowingadefinitivedifferencebetweenthefrequenciesoftransitionsandtransversions,transitionsbeingnoticeablyhigher.Asaresult,theK80modelwasdevelopedandimplementedbyKimuraandassociatesinresponsetothesefindings.

Theratematrix(Q)intheK80modeldisplaystworates;alpha(representingthesubstitutionratesofthetransitions)andbeta(representingthesubstitutionratesofthetransversions).InthefollowingrepresentationofthematrixQ,alpha=Kandbeta=1:

AswiththeratematrixfortheJC69model,thediagonalelementsofthematrixQarenotincluded,asthesearenotregardedassubstitutions.Thetotalsubstitutionrateforanynucleotidewouldbea+2b(K+1+1).DerivingthetransitionprobabilitymatrixfromthematrixQisslightlymoredifficultthanfortheJC69model,thetransitionprobabilitymatrix(P(t))isasfollows:P(t)=Where:p0(t)=1/4.0+1/4.0*exp(‐4*b*t)+1/2.0*exp(‐2*(a+b)*t)p1(t)=1/4.0+1/4.0*exp(‐4*b*t)‐1/2.0*exp(‐2*(a+b)*t)p2(t)=1/4.0‐1/4.0*exp(‐4*b*t)AswiththeJC69model,wecanalsocreateaprogramthatwillemulatethetransitionprobabilitymatrixwithrelativeeasebyinputtingtheparametervaluesforalpha(a),beta(b)andtime(t).Also,organisingtheformulaeofthetransitionprobabilitymatrixinasimilarwaytotheJC69modelusingPythondefinesthefollowingfunction:

p0(t)p1(t)p2(t)p2(t)p1(t)p0(t)p2(t)p2(t)p2(t)p2(t)p0(t)p1(t)p2(t)p2(t)p1(t)p0(t)

Thefunctionistestedusingtheparameters;a=0.4,b=0.2,t=1.Thetransitionprobabilitiesfornucleotidesexperiencingnosubstitutionsaftert=1arehigh,whereinthetransitionprobabilitiesfortransitionsandtransversionsarerelativelylowincomparison.Whenconsideringtheformulaeusedtocalculatetheseprobabilities,certaininevitabletrendsarerecognisable:

xrepresentstheprobabilityofanucleotideexperiencingnochangeoveragiventime.Whent=0,x=1:fromthispoint,xdecreasesexponentiallytothevalueof¼.yrepresentstheprobabilityofanucleotideexperiencingatransition(A<‐>GorT<‐>C)overagiventime.Att=0,thevalueofyis0;whennotimehaspassed,theprobabilityofagivennucleotideexperiencinganysortofsubstitutionis0.Thisisalsotruefortransversionalsubstitutions,representedbyequationz.Astimeincreasesfrom0,thetransitionalprobabilitiesforbothtransversionsandtransitionsincrease,tendingtowards¼.Astheratesoftransitionalchangearehigherthanthoseoftransversionalchange,thetransitionprobabilitiesfortransitionalsubstitutionsincreasetowards¼atahigherrate.Thefollowinggraphrepresentsthechangesinthetransitionalprobabilitiesoftransitions,transversionsandnosubstitutionsitesastimeincreases:

Tocreatethisgraph,thevaluesofalphaandbetaweresetto0.4and0.2respectively.Thesevaluessimulaterealisticvaluesfortheratesfortransitionsandtransversionsasobservedrateshaveshownthattransitionalsubstitutionsoccurata

higherfrequencytotransversionalsubstitutions.Timerangesfrom0to10,increasingby0.1withineachinterval.HKY85andTN93ModelsHasegawa,KishinoandYanodevelopedamodelin1985thatcombinedelementsofboththeK80andF81models.ThisisknownastheHKY85modelandincorporatesmultipleparameterstocreateamorerealisticsimulationofhownucleotidesequencesessentiallybehave.Firstofall,theHKY85modelassumesthattheratesofsubstitutiondifferbetweeneachnucleotide.Asinglevaluewoulddefinetheratesforatargetnucleotidehavingbeenevolvedinto.Forexample,avaluefortherateofTwoulddefinetheratesbywhichanynucleotidewouldbesubstitutedtoresultinthecreationofthenucleotideT.Theseratesareknownasbasefrequenciesandwithinthismodel,thebasefrequenciesaredeemedunequal.FurtherparametersareincludedtodistinguishbetweentheratesoftransitionsandtransversionsaswithintheK80model.Afterthefirstmitochondrialsequenceswerepublishedin1980,thedifferencebetweentheratesoftransitionsandtransversionswasmadedefinitiveandsomostnucleotideevolutionmodelscreatedafter1980incorporateparametersthatdefinetheratesoftransitionsandtransversionsseparately.TheHKY85modelisseentogiveamoreaccuraterepresentationofnucleotidesubstitutionsincomparisontotheJC69,K80andF81modelsbyaccommodatingmultiplefactors.ThefollowingimagerepresentstheratematrixQ:

Thematrixisorganisedastheratematricesforallpreviousmodelshavebeen,thecolumnsandrowsareinthenucleotideorder;T,C,A,Grespectively.WithinthisrepresentationofthematrixQ,Krepresentstransitionalsubstitutions.Allothersubstitutionsareassumedtobetransversionalotherthanthediagonalvaluesofthe

matrix,whicharenotsubstitutions.πTrepresentstherateofsubstitutionsresultingintheformationofthenucleotideTasmentionedbefore.πCrepresentstherateofsubstitutionsresultingintheformationofthenucleotideCandsoon. Derivingthetransitionprobabilitymatrix(P(t))isnotassimpleaswiththepreviousmodelsduetothematrixQnotbeingadiagonalmatrix.Therefore,thematrixQisinitiallydiagonalized,followedbytheexponentiationofthediagonaltoproducethematrixP(t):

Where:

Mostofthetransitionprobabilitiesdifferforeachsubstitutionwithinthismodel;thismorecloselyemulateshownucleotideswouldbehaveinreal‐lifeincomparisontothepreviousmodels.Morefactorsaretakenintoaccounttoachievethisandsotheformulaeincreaseincomplexityastheyaccommodatealargernumberofvariables.Writingafunctiontocarryouttheformulaeinthetransitionprobabilitymatrixisslightlymoretime‐consumingthanpreviousmodelsbutitisstillachievable:

Parametersfortime,transitionrate,transversionrateandthebasefrequenciesmustbedefinedinordertogeneratethetransitionprobabilitymatrix.Thefunctionisthentestedwithexperimentalparameters,generatingthematrixatthebottomoftheimage.Att=0,thediagonalelementsofP(t)areat1whilstallothervaluesareat0.Thisisbecauseatt=0,wewouldnotexpectanysubstitutionstohaveoccurredtoanucleotidesequence.Astimetendstoinfinity,theprobabilitiesofthediagonal

elementsdecrease,asallotherelementsincrease,totheirrespectivebasefrequencies.Thiswouldbetheresultofthenucleotidesinthesequencereachingastationarydistribution:whentheproportionsofeachnucleotidematchtheirrespectivebasefrequencies.Theseproportionswouldbemaintained,asfurthersubstitutionswouldcontinuetogeneratethesameproportionsofnucleotides.Therefore,inthiscase,thestationarydistributionisalsothelimitingdistribution.ThedifferencebetweentheratesofsubstitutionoftransitionsandtransversionswaswellestablishedandresoundswithinmostnucleotidemodelscreatedaftertheK80model.However,withintransitionsafurtherdifferenceinratescanbedistinguished.NucleotidesAandGareknownaspurinemoleculesandnucleotidesTandCareknownaspyrimidinemolecules;thedifferencebeingthemolecularstructuresofthenucleotides.Generally,purinesandpyrimidinestendtohavedifferentratesofsubstitution;therefore,amorerecentmodeltothosediscussedsofarhasbeendevelopedtoaccommodateforthisfactor.In1993,TamuraandNeiproposedanewmodel,whichincludedparametersthatwoulddistinguishbetweentheratesofpyrimidinesandpurinesrespectively.ThismodeliscommonlyknownastheTN93modelandintroducestheparameters;alpha1andalpha2inreplacementofthesinglealphaparameterpresentintheHKY85modelfortransitionalrates.TheratematrixforthismodelisthereforeverysimilartothatoftheHKY85model,aswellasthetransitionprobabilitymatrix:

MatrixP(t)=

Where:

SimulationofnucleotidesequencesThepreviouslydiscussedmodelsofnucleotidesubstitutionallallowforthegenerationofprobabilitiesthatdeterminehowanucleotidesequencewillorhasevolvedbasedonlikelihood.Fromthis,afunctioncanbeusedtosimulatehowasequenceofnucleotidesmayevolvebasedontheseprobabilities.Forexample,takingtheprinciplesofthesimplestmodel,JC69,wecansaythattheprobabilitiesforanucleotidechangingintooneoftheothernucleotidesareequal.Therefore,whensimulatingascheduledsubstitutionofanucleotide,becauseeachtransitionprobabilityisthesame,thetargetnucleotidecanberandomlychosenandthesequencemutated.Ifthetransitionprobabilitieswereunequal,thetargetnucleotidewouldberandomlychosenbutwithincorporatedbiasfavouringmoreprobabletransitions.AfunctionmustbedesignedtofirstgeneratearandomtimeatwhichamutationwilloccurbasedonthetotalsubstitutionratesofallthenucleotidesofthesequenceusingtheratematrixQ.Atimeintervaloverwhichmutationswilloccurmustbeoutlined,forsimplicitytheintervalfromt=0tot=1isusedoften(timex).Tobeginmutation,asequenceofnucleotidesmustbeprovided;throughtheuseofafunction,anucleotidesequenceofanylengthcanbegenerated(genseq).Usingthetimexfunction,alistoftimesisgeneratedwhenarateisinputtedintothefunction.Inthiscase,thetotalrateforallnucleotidesofthesequenceisinputtedandalistoftimesgeneratedrandomly,thesetimesareusedasthetimesofmutation.Thistechniquecannotbeusedformorecomplexmodelsofnucleotideevolutionastheyassumeunequaltransitionprobabilitiesandsoafterasubstitution,thetotalratewouldchangewiththedepartureofonenucleotideandthecreationofanewnucleotide.InbasingsimulationusingtheJC69model;thetransitionprobabilitymatrixfortheJC69modelisusedtogeneratetheprobabilitiesformutationsorfornochanges.Thegenseqandtimexfunctionsarebothusedtogenerateasequenceofnucleotidesandtothencreatealistoftimesatwhichmutationswilltakeplace.Pleaselooktothefunctionssectionstowardstheendofthisreportfordefinitionsofeachfunction.3ThefollowingisasequenceofnucleotidesbeforeandaftermutationusingtheJC69transitionprobabilitymatrix:Before

After

Although5differencesarevisiblefromtheinitialsequencetothesequenceaftermutation,7actualmutationshadoccurredwithtwoofthemutationsactingonthesamestartingnucleotide,the8th,withthesecondmutationreturningthe8th

nucleotidebacktoitsstartingstate(nucleotideC).7mutationswereachievedusingthetimexfunctionandinputtingavalueof4.5forrate(at).SimulationofmutationusingtheK80modelrequiresaslightlydifferentmethod,asdoessimulationusingtheHKY85andTN93modelsduetothedifferingprinciplesandparametersbetweeneachmodel.Theseprinciplesarequiteeasilysummarisable:

K80‐astransitionsandtransversionsmustbedistinguishedbetweenastheyoccuratdifferentrates,thefunctionwrittenforsimulatingmutationundertheprobabilitiesgeneratedbytheK80modelaccountsforthis.Thisthenresultsintransitionmutationsandtranversionmutationsoccurringatdifferentratestothenucleotidesequencebeingmutatedaccordingly.

HKY85‐AstheHKY85modelutilisesseveraldifferentparametersandthereforeratestodistinguishprobabilities,thefunctionwrittentosimulateundertheprinciplesofthismodelusesmultiplerateswhenconductingamutation.Also,aseachnucleotideissubjecttodifferentratesofmutation,thetotalratebywhichanymutationwilloccurusingthetimexfunctionisupdatedafteranynucleotideismutatedandchangedintoanothertoaccountforthischange.

TN93‐thefunctionsimulatingmutationundertheprinciplesoftheTN93modelactsinthesamewayasthefunctionusedfortheHKY85model.TheonlydifferenceisthattheTN93modelintroducesanadditionalrate,breakingtherateforalpha(transitions)intoalpha1(transitionsbetweenpyrimidines)andalpha2(transitionsbetweenpurines).

Thefunctionswrittenforthesimulationofthemutationofanucleotidesequenceareincludedintheappendixandarelabelledaccordingly.MaximumLikelihoodEstimates(MLE)‐JC69&K80ModelsMaximumlikelihoodestimatesareusedtoestimateparametervaluesforastatisticalmodelwhenapplyingthatmodeltoadataset.Inthecaseofnucleotidesubstitutions,thestatisticalmodelsfittedtodataarethemodelsofnucleotidesubstitutionandtheparameterestimatedisthevalueforrateandtime.Rateandtimearedealtwithasasinglevalueastheycannotbedistinguishedfromoneanother;thesinglevalue(at)canbeproducedbytheproductofanumberofdifferentcombinationsofvaluesofeitheralphaortime.Thedatasetusedwillbetwosequencesofnucleotidesofequallengthsofwhichonesequencewillbeassumedtohaveevolvedfromtheotherthroughseveralmutations.Thetotallengthofasequenceisrepresentedbytheletternandthedifferences(numbersofnucleotideswhichdifferbetweeneachsequence)isrepresentedbytheletterk.JC69Toexplainthetheorybehindacquiringthemaximumlikelihoodestimate,thebinomialdistributionmustbeconsidered.Thefollowingistheprobabilitymassfunction(pmf)ofthebinomialdistribution:

n= The total length of a sequence. k= The number of differences between the two sequences. Theprobabilitymassfunctionisusedtocalculatetheprobabilitywhenavariable(at)isexactlyequaltothevalueproposedforthevariable.Forexample,ifavalueforatisinputtedintotheprobabilitymassfunction,thevaluecalculatedwillrepresenttheprobabilitythatthevalueforatusedtocalculatetheprobabilityiscorrect.InreplacementofthevariablepistheequationusedinthetransitionprobabilitymatrixfortheJC69modeltocalculatetheprobabilityofamutationoccurring.Theequationusedinreplacementof1‐pistheequationfromthetransitionprobabilitymatrixoftheJC69modelusedtocalculatetheprobabilityofamutationnotoccurring.Thefollowingequationistheprobabilitymassfunction,alteredtoincludethevariablesmentionedabovewiththetotallengthofasequence(n)as100andthenumberofdifferences(k)as40.Thenotationpow(x,y)representsthevaluextothepowerofy:Probabilitymassfunction=l

Thevariablemrepresentsthevalueat.Findingthevalueofatwiththehighestprobabilitycanbefoundthroughtrialanderror,howeverusingPYTHONallvaluesofatwithinanintervalcanbetestedandplottedontoagraph:

Theprobabilitymassfunctionequationdisplayedabovewasusedtogeneratethedatatoplotthisgraph.Thevaluesofm(at)withintheinterval0to0.4weretestedandapeakprobabilitywasacquired.Thepeakrepresentsthevalueofm(at)withthehighestprobabilityofresultinginthevalueofkandthereforeisthemaximumlikelihoodestimate.Inthiscase,themaximumlikelihoodestimateis0.19forat.K80TofindthemaximumlikelihoodestimateusingtheprinciplesoftheK80modelisapproachedinaverysimilarwayaswiththeJC69model.Theprobabilitymass

functionisadjustedsothattwovaluesareestimatedastherearetwoparametersforratesintheK80model,alphaandbeta.Pmf=p0^(n–k‐j)*p1^k*p2^jWhere:p0=theequationusedfromthetransitionprobabilitymatrixoftheK80modeltocalculatetheprobabilityofnomutationoccurring. p1=theequationusedfromthetransitionprobabilitymatrixoftheK80modeltocalculatetheprobabilityofatransitionmutationoccurring. p2=theequationusedfromthetransitionprobabilitymatrixoftheK80modeltocalculatetheprobabilityofatransversionmutationoccurring.

n=thetotallengthofasequence.k=thenumberofdifferencesbetweentwosequencesthathaveresulted

fromtransitionmutations. j=thenumberofdifferencesbetweentwosequencesthathaveresultedfromtransversionmutations.Probabilitymassfunction=l

aandbrepresentthevaluesfortheratesoftransitions(alpha)andtransversions(beta)respectively.UsingPYTHONatablecanbegeneratedshowingtheprobabilitiesofavalueofabeingmostlikelywhenbisofanothervalue.Thevaluesinthistablecanbeplottedgraphicallyusingacontourplot.Thefollowingisacontourplotgeneratedusingtheequationforprobabilitymassfunctiondisplayedabove,howeverthetotallengthofasequence(n)is100,thenumberofdifferencesthathaveresultedfromtransitionmutations(k)is30andthenumberofdifferencesbetweentwosequencesthathaveresultedfromtransversionmutationsis10:

Thelinesbecomeconcentratedaroundthemaximumlikelihoodestimatesforthevaluesofalpha(rateoftransitions)andbeta(rateoftranversions).Theestimateforthemostprobablevalueofbisclearlycentredontheintervalbetween0.12and0.14.Unfortunately,thevalueforaisnotvisibleasthelimitsofthiscontourgraphdonotshowwherethelinesofthegraphcentreonthey‐axis.Maximumlikelihoodestimatesareusedinconjunctionwithmodelsofnucleotideevolutionmainlytoestimatethetimetakenforonesequenceofnucleotidestoevolveintoanother,assumingthatonesequenceistheancestoroftheother.Althoughonlyavalueforat,theproductofbothrateandtime,isachievableifanaveragerate(orratesinthecaseofmultipleparametermodels)isknown.Usingtheknownvalueforrate,thevariableoftimecanbedistinguishedandsothetimetakenforonesequencetomutateintotheotheriscalculatable.Practically,biologistsandstatisticianshaveadoptedthismethodwhenattemptingtocalculatethetimetakenforparticularspecies(suchashumans)tohaveevolvedfromancestralspecies(suchaslesserevolvedprimates).ByassessingthesamesectionsofDNAfromthetwospeciesofthesamelength,thenumberofdifferencesmayberecordedusedtoestimateatimeusingthemaximumlikelihoodmethod.ConclusionsAsmyinvestigationwasnotanexperimentassuchbutratherthetranslationofstatisticalmodelsontosoftwaresoastousethesemodelsinpracticalsituations,myconclusionwouldbetostatethattheprogrammesthatIhavewrittentoemulatethesestatisticalmodelshavebeensuccessfulandsomaybeappliedtopracticaldatasets.Thistranslationallink,betweenstatisticalmodelsandnewcomputingsoftwareembodiesthebasicprinciplesofbioinformaticsandallowsdemonstrationsofhowstatisticiansandbiologistscanthereforeusethesemodelswhendealingwithmutatedsequencesofDNA.IfIhadfurtherresearchtimeandpossiblyslightlymoreoptionsintermsofcomputingsoftware,therearemultipleareasthatIwouldhaveexpandedwithinmyprojectandreport.Firstofall,Iwouldhaveincludedastep‐by‐stepexplanationoftheTaylorSeriesexpansionallowingforreaderstounderstandthemathematicaltheorybehindobtainingthetransitionprobabilitymatrixfromtheratematrixofanucleotidemodel.Also,Iwouldhaveexploredfurthermodelsofnucleotideevolution,astherearemanymoresignificantmodelsthathavenotbeenmentioned.Thesemodelswouldhavebroadenedthescopeofmyprojectandwouldhavedepictedfurtherstepsbywhicheachmodelwaschronologicallyimproved.Withinthelastsectionofthisreport,themaximumlikelihoodestimationoftheJC69andK80models,Ibelievethatthissectioncouldbeprogressedfurther.Withaccesstoalternativecomputingsoftwarethatcouldplotmulti‐dimensionalgraphs,IwouldhaveextendedthecalculationofmaximumlikelihoodestimatesintoestimatingtheparametersfortheHKY85andTN93models.References:

ComputationalMolecularEvolution(Yang2006) www.wikipedia.org

www.python.org http://docs.python.org/lib/module‐random.html http://docs.python.org/lib/module‐random.html http://www.tau.ac.il/~doronadi/F81_model.doc http://www.megasoftware.net/WebHelp/part_iv___evolutionary_analysis/c

omputing_evolutionary_distances/distance_models/nucleotide_substitution_models/hc_jukes_cantor_distance.htm

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080 http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080 EvolutionaryTreesfromDNASequences:

AMaximumLikelihoodApproach(JosephFelsenstein1981) ANovelUseofEquilibriumFrequenciesinModelsofSequenceEvolution

(NickGoldmanandSimonWhelan)

FunctionsGenseq‐thegenerationofarandomsequenceofnucleotidesisessentialtothesimulationofnucleotidesubstitution.Todefineafunctiontogenerateasequence,aparameterforthelengthofthesequencemustbedefined.Inthiscase,nisused.Thefunctionrandomlychoosesaletter,representingeachnucleotide,fromthelist“ACGT”usingthein‐built‘randint’function.Thechosenletterisaddedtoalist;theprocessofchoosingaletteristhenrepeatedntimescreatingalistor‘sequence’nnucleotideslong.

Timex‐thisfunctionallowsforthegenerationofacumulativesetoftimesthatrepresentwhenmutationswilloccurstoanucleotidesequence.Thisfunctionisonlyusedwithinthesimplermodelsofsubstitutionasitassumesthattransitionprobabilitiesarethesameforeachnucleotide.Anin‐builtfunction(random.expovariate)takesavalueforrateasaparameterandgeneratesanothervalueusingthisratevalue.Inputtingahigherratevaluewillincreasetheprobabilityofthein‐builtfunctiongeneratingasmallervalue.Valuesaregeneratedusingthesameratevalueandaredisplayedcumulativelytorepresentthetimesatwhicheventsoccuraccordingtotheinputtedratevalue.Thisprocessisterminatedwhenthecumulativetimevalueincreasesover1asweareonlyinterestedinmutationsoccurringwithintimes0and1.Thisfunctioniseffective,astheoretically,ifeventsoccuratahigherrate,moreeventswilloccurinagiventime.

Intgen‐thisfunctionwascreatedtogeneratealist,oflengthn,ofrandomnumbers.Theserandomnumbersdenoteatwhatpointsmutationswilloccur.Thetimexfunctionisinitiallyusedtocalculatethenumberofmutationsthatwilloccurinanallottedtime.Thenumberofcalculatedmutationswillthensignifythelengthofthe

listofrandomnumbers.Eachnumberwithinthislistreferstothenthnucleotideofasequencebeingmutated.Thatnucleotidewillthenbemutated.

Appendix:K80‐Functionforsimulationofmutationofnucleotidesequence

HKY85‐Functionforsimulationofmutationofnucleotidesequence

TN93‐Functionforsimulationofmutationofnucleotidesequence