basic models of nucleotide evolution report
TRANSCRIPT
BasicModelsofNucleotideEvolutionOvertime,nucleotideswithinasequencecan‘evolve’throughsubstitution.Thisprocesscancauseanucleotide(T,C,AorG)tochangeintoanothernucleotideandisthemaindrivingforcebehindevolution.Forexample,thenucleotideAinasequenceofDNAcanchangeovertimeintothenucleotideC.ThischangemayresultinthissequenceofDNAbecominginactiveifthesequencewaspreviouslyinvolvedinproteinsynthesisasanexon,ormaychangetheproteinthatthesequencecodes.Asproteinsarethebuildingblocksoforganiclife,thismaycauselargechangesinanorganism’sfeatures.Alternatively,thischangemayhavenoeffectatall.Onaverage,thisformofmutationonlyoccursonceortwiceeverymillionyears.However,inassessingtheevolutionofspeciesoverhundredsofmillionsofyears,modelsareusefulinevaluatinghowonesequenceofnucleotidesmayhaveevolvedfromanother.ModelsofnucleotideevolutioncanbeusedwhenexaminingtwosequencesofDNAofthesamelengththatmayberelated.Thistypeofmodelwouldbeusedtocomparethetwosequencesbyeitherassumingthatonesequenceevolvedintotheotherorvice‐versa,orassumingthattheyhadevolvedfromacommon‘ancestral’sequenceofDNA.Applyingthemodelwouldgivetheestimatednumberofnucleotidesubstitutionspersite,calledthedistance,whichwouldthenbeusedtoestimateatime.Thistimecouldthenrelatetowhenonesequenceevolvedfromtheotherorwouldrelatetohowlongagothatan‘ancestral’sequenceofDNAwouldhavedivergedintoeachsequence.Inthispaper,Iwilloutlinetheprinciplesandtheorybehindthemain(mostcommonlyused)modelsofnucleotidesubstitution,addressingeachmodelchronologicallyandinsomesenseswithincreasingcomplexity.Themodelsareasfollows:
o JukesandCantor1969(JC69)o Kimura1980(K80)o Felsenstein1981(F81)o Hasegawa,KishinoandYano(HKY85)o TamuraandNei1993(TN93)
Iwilldemonstratehowprogrammingsoftwaremaybeusedtoprocessdatausingtheformulaeproposedwithineachmodel.FromthisIwillexplainhow,continuingtouseprogrammingsoftware,eachmodeliscapableofsimulatingtheevolutionofanucleotidesequenceoveragiventime.JC69ModelIntermsofcreatingmodelsthatassessnucleotidesubstitution,therateofsubstitutionfromonenucleotidetoanotherandthetimeoverwhichsubstitutionhasbeenallowedtoactarekeyvariables.Differentmodelsorganisetheiruseofratesindifferentwaysbuttimeisalwaysusedinthesameway.ThesimplestmodelofnucleotidesubstitutionistheJukesandCantor1969(JC69)model.Thismodelassumesthattherateofsubstitutionisthesamebetweenallnucleotides.Therefore,thismodelonlyrequiresasingleparameter‐denotingrate,alongwithavaluefortime.A4x4matrixcanbecreatedshowingtheratesofnucleotidesubstitutionbetweenthe4nucleotides.ThisisknownasmatrixQ:
Q=
Alongthediagonalofthismatrix,youcanseethattheratesofnucleotideschangingintothemselvesarenotdisplayed,astheyarenotregardedassubstitutions.Also,therowssumto0.UsingtheratesinmatrixQ,wecanworkouttheprobabilityofeachnucleotidesubstitutionoccurringwhent>0,creatinganothermatrix.Thismatrixisknownasthetransitionprobabilitymatrix(P(t))andisalsoa4x4matrix:P(t)=
Theseformulaecalculatetheprobabilityofonenucleotideevolvingintoanother.TheyareachievedthroughtheexponentiationoftheMatrixQusingtheMatrixTaylorseries.IntermsofusingthematrixP(t)withreal‐worldorexperimentaldata,aprogramcanbewrittenwhichwillcalculatethetransitionprobabilitiesofeachnucleotidesubstitutionusingtheformulaeinP(t).Pythonisprogrammingsoftwarethatprovidesabasicbuteffectiveprogramminglanguage,whichcanbeusedinthesecircumstances.WemustfirstdefineafunctionthatwillimplementtheformulaeofthematrixP(t)whengivencertainvaluestoworkfrom.Thesevaluesarecalledparametersandinthecaseofworkingoutthetransitionprobabilities,wemustinputavaluefortherateatwhichnucleotidesubstitutionswilloccuraswellasavalueforthetimeoverwhichsubstitutionswilloccur.
Thefollowingcode,writteninPython,emulatesthematrixP(t):
Asshownatthebottomoftheimage,inputtinganexperimentalrate(0.2)andtime(1)teststhefunction‘JC69’usedtocalculatethetransitionprobabilities.ThisisfollowedbyamatrixdisplayingtheprobabilitiesrowbyrowwithnucleotideorderT,C,AandG,inthesameorientationasthematrixQ.Inlookingattheformulaeusedtocalculatethetransitionprobabilities,conclusionscanbemadetohowtheincreasingrateortimewillaffecttheresultantprobabilities.
Theexponential(exp)ofanegativevaluegivesadecimalnumbersmallerthan1.Ifthenegativevalueincreasesinsize,theexponentialofthatvaluebecomessmalleratanincreasingrate.Therefore,asthenegativevaluetendstoinfinity,theexponentialofthatvaluetendsto0.InlookingattheaboveformulaeXandY,asthevaluesofm(rate)andt(time)increase,thevaluesbeingaddedto¼inXandsubtractedfrom¼inYbecomeinfinitelysmaller.Thisresultsinthetransitionprobabilitiestendingtowards¼foreachnucleotidesubstitution.Thissupportstheassumptionthatoveranincreasedtimeorrate,somanynucleotidesubstitutionswouldhaveoccurredthatthetargetnucleotideiseventuallyrandom,withaprobabilityof¼foreachnucleotide.
Thisisdemonstratedinthefollowinggraph,takingincreasingvaluesforratewithaconstanttimeof1:
Pii(t)representstheprobabilitythatanucleotidewillnotexperienceasubstitutionoveraperiodoftime(t).Pij(t)representstheprobabilitythatanucleotidewillexperienceasubstitutionandevolveintoanothernucleotideoveraperiodoftime.Atthepointwhentime=infinity,overwhichanucleotidesequencehadbeenallowedtoevolve,theproportionofnucleotidesofeachtype(T,C,A,G)willhavereached¼foreach.ThisdistributionofnucleotidesiscalledthelimitingdistributionandastheratesofchangearethesameforallnucleotidesintheJC69model,thisproportionwillbemaintained.Thisproportionalequilibriumiscalledthestationarydistribution.K80ModelKimuraandassociatescreatedamodelproposingamorecomplexmixofratesbetweennucleotidesubstitutionsin1980.ThismodeliscommonlyknownastheK80modelandusestworatesasparametersalongwithtime.Nucleotidesubstitutionscanbeclassifiedasoneoftwotypes;transitionsandtransversions.Transitionsaresubstitutionsbetweennucleotidesofthesameorsimilarmolecularstructure;betweenpurinesorbetweenpyrimidines,andarepronetooccurmorefrequentlytoothersubstitutions.NucleotidesAandGarepurinemoleculesandexperiencehighersubstitutionsbetweeneachother,aswellasnucleotidesTandCwhicharepyrimidinemolecules.Allothersubstitutionsaretranversionsandareknowntooccurlessfrequentlythantransitions.In1980,thefirstmitochondrialsequenceswerepublishedshowingadefinitivedifferencebetweenthefrequenciesoftransitionsandtransversions,transitionsbeingnoticeablyhigher.Asaresult,theK80modelwasdevelopedandimplementedbyKimuraandassociatesinresponsetothesefindings.
Theratematrix(Q)intheK80modeldisplaystworates;alpha(representingthesubstitutionratesofthetransitions)andbeta(representingthesubstitutionratesofthetransversions).InthefollowingrepresentationofthematrixQ,alpha=Kandbeta=1:
AswiththeratematrixfortheJC69model,thediagonalelementsofthematrixQarenotincluded,asthesearenotregardedassubstitutions.Thetotalsubstitutionrateforanynucleotidewouldbea+2b(K+1+1).DerivingthetransitionprobabilitymatrixfromthematrixQisslightlymoredifficultthanfortheJC69model,thetransitionprobabilitymatrix(P(t))isasfollows:P(t)=Where:p0(t)=1/4.0+1/4.0*exp(‐4*b*t)+1/2.0*exp(‐2*(a+b)*t)p1(t)=1/4.0+1/4.0*exp(‐4*b*t)‐1/2.0*exp(‐2*(a+b)*t)p2(t)=1/4.0‐1/4.0*exp(‐4*b*t)AswiththeJC69model,wecanalsocreateaprogramthatwillemulatethetransitionprobabilitymatrixwithrelativeeasebyinputtingtheparametervaluesforalpha(a),beta(b)andtime(t).Also,organisingtheformulaeofthetransitionprobabilitymatrixinasimilarwaytotheJC69modelusingPythondefinesthefollowingfunction:
p0(t)p1(t)p2(t)p2(t)p1(t)p0(t)p2(t)p2(t)p2(t)p2(t)p0(t)p1(t)p2(t)p2(t)p1(t)p0(t)
Thefunctionistestedusingtheparameters;a=0.4,b=0.2,t=1.Thetransitionprobabilitiesfornucleotidesexperiencingnosubstitutionsaftert=1arehigh,whereinthetransitionprobabilitiesfortransitionsandtransversionsarerelativelylowincomparison.Whenconsideringtheformulaeusedtocalculatetheseprobabilities,certaininevitabletrendsarerecognisable:
xrepresentstheprobabilityofanucleotideexperiencingnochangeoveragiventime.Whent=0,x=1:fromthispoint,xdecreasesexponentiallytothevalueof¼.yrepresentstheprobabilityofanucleotideexperiencingatransition(A<‐>GorT<‐>C)overagiventime.Att=0,thevalueofyis0;whennotimehaspassed,theprobabilityofagivennucleotideexperiencinganysortofsubstitutionis0.Thisisalsotruefortransversionalsubstitutions,representedbyequationz.Astimeincreasesfrom0,thetransitionalprobabilitiesforbothtransversionsandtransitionsincrease,tendingtowards¼.Astheratesoftransitionalchangearehigherthanthoseoftransversionalchange,thetransitionprobabilitiesfortransitionalsubstitutionsincreasetowards¼atahigherrate.Thefollowinggraphrepresentsthechangesinthetransitionalprobabilitiesoftransitions,transversionsandnosubstitutionsitesastimeincreases:
Tocreatethisgraph,thevaluesofalphaandbetaweresetto0.4and0.2respectively.Thesevaluessimulaterealisticvaluesfortheratesfortransitionsandtransversionsasobservedrateshaveshownthattransitionalsubstitutionsoccurata
higherfrequencytotransversionalsubstitutions.Timerangesfrom0to10,increasingby0.1withineachinterval.HKY85andTN93ModelsHasegawa,KishinoandYanodevelopedamodelin1985thatcombinedelementsofboththeK80andF81models.ThisisknownastheHKY85modelandincorporatesmultipleparameterstocreateamorerealisticsimulationofhownucleotidesequencesessentiallybehave.Firstofall,theHKY85modelassumesthattheratesofsubstitutiondifferbetweeneachnucleotide.Asinglevaluewoulddefinetheratesforatargetnucleotidehavingbeenevolvedinto.Forexample,avaluefortherateofTwoulddefinetheratesbywhichanynucleotidewouldbesubstitutedtoresultinthecreationofthenucleotideT.Theseratesareknownasbasefrequenciesandwithinthismodel,thebasefrequenciesaredeemedunequal.FurtherparametersareincludedtodistinguishbetweentheratesoftransitionsandtransversionsaswithintheK80model.Afterthefirstmitochondrialsequenceswerepublishedin1980,thedifferencebetweentheratesoftransitionsandtransversionswasmadedefinitiveandsomostnucleotideevolutionmodelscreatedafter1980incorporateparametersthatdefinetheratesoftransitionsandtransversionsseparately.TheHKY85modelisseentogiveamoreaccuraterepresentationofnucleotidesubstitutionsincomparisontotheJC69,K80andF81modelsbyaccommodatingmultiplefactors.ThefollowingimagerepresentstheratematrixQ:
Thematrixisorganisedastheratematricesforallpreviousmodelshavebeen,thecolumnsandrowsareinthenucleotideorder;T,C,A,Grespectively.WithinthisrepresentationofthematrixQ,Krepresentstransitionalsubstitutions.Allothersubstitutionsareassumedtobetransversionalotherthanthediagonalvaluesofthe
matrix,whicharenotsubstitutions.πTrepresentstherateofsubstitutionsresultingintheformationofthenucleotideTasmentionedbefore.πCrepresentstherateofsubstitutionsresultingintheformationofthenucleotideCandsoon. Derivingthetransitionprobabilitymatrix(P(t))isnotassimpleaswiththepreviousmodelsduetothematrixQnotbeingadiagonalmatrix.Therefore,thematrixQisinitiallydiagonalized,followedbytheexponentiationofthediagonaltoproducethematrixP(t):
Where:
Mostofthetransitionprobabilitiesdifferforeachsubstitutionwithinthismodel;thismorecloselyemulateshownucleotideswouldbehaveinreal‐lifeincomparisontothepreviousmodels.Morefactorsaretakenintoaccounttoachievethisandsotheformulaeincreaseincomplexityastheyaccommodatealargernumberofvariables.Writingafunctiontocarryouttheformulaeinthetransitionprobabilitymatrixisslightlymoretime‐consumingthanpreviousmodelsbutitisstillachievable:
Parametersfortime,transitionrate,transversionrateandthebasefrequenciesmustbedefinedinordertogeneratethetransitionprobabilitymatrix.Thefunctionisthentestedwithexperimentalparameters,generatingthematrixatthebottomoftheimage.Att=0,thediagonalelementsofP(t)areat1whilstallothervaluesareat0.Thisisbecauseatt=0,wewouldnotexpectanysubstitutionstohaveoccurredtoanucleotidesequence.Astimetendstoinfinity,theprobabilitiesofthediagonal
elementsdecrease,asallotherelementsincrease,totheirrespectivebasefrequencies.Thiswouldbetheresultofthenucleotidesinthesequencereachingastationarydistribution:whentheproportionsofeachnucleotidematchtheirrespectivebasefrequencies.Theseproportionswouldbemaintained,asfurthersubstitutionswouldcontinuetogeneratethesameproportionsofnucleotides.Therefore,inthiscase,thestationarydistributionisalsothelimitingdistribution.ThedifferencebetweentheratesofsubstitutionoftransitionsandtransversionswaswellestablishedandresoundswithinmostnucleotidemodelscreatedaftertheK80model.However,withintransitionsafurtherdifferenceinratescanbedistinguished.NucleotidesAandGareknownaspurinemoleculesandnucleotidesTandCareknownaspyrimidinemolecules;thedifferencebeingthemolecularstructuresofthenucleotides.Generally,purinesandpyrimidinestendtohavedifferentratesofsubstitution;therefore,amorerecentmodeltothosediscussedsofarhasbeendevelopedtoaccommodateforthisfactor.In1993,TamuraandNeiproposedanewmodel,whichincludedparametersthatwoulddistinguishbetweentheratesofpyrimidinesandpurinesrespectively.ThismodeliscommonlyknownastheTN93modelandintroducestheparameters;alpha1andalpha2inreplacementofthesinglealphaparameterpresentintheHKY85modelfortransitionalrates.TheratematrixforthismodelisthereforeverysimilartothatoftheHKY85model,aswellasthetransitionprobabilitymatrix:
MatrixP(t)=
Where:
SimulationofnucleotidesequencesThepreviouslydiscussedmodelsofnucleotidesubstitutionallallowforthegenerationofprobabilitiesthatdeterminehowanucleotidesequencewillorhasevolvedbasedonlikelihood.Fromthis,afunctioncanbeusedtosimulatehowasequenceofnucleotidesmayevolvebasedontheseprobabilities.Forexample,takingtheprinciplesofthesimplestmodel,JC69,wecansaythattheprobabilitiesforanucleotidechangingintooneoftheothernucleotidesareequal.Therefore,whensimulatingascheduledsubstitutionofanucleotide,becauseeachtransitionprobabilityisthesame,thetargetnucleotidecanberandomlychosenandthesequencemutated.Ifthetransitionprobabilitieswereunequal,thetargetnucleotidewouldberandomlychosenbutwithincorporatedbiasfavouringmoreprobabletransitions.AfunctionmustbedesignedtofirstgeneratearandomtimeatwhichamutationwilloccurbasedonthetotalsubstitutionratesofallthenucleotidesofthesequenceusingtheratematrixQ.Atimeintervaloverwhichmutationswilloccurmustbeoutlined,forsimplicitytheintervalfromt=0tot=1isusedoften(timex).Tobeginmutation,asequenceofnucleotidesmustbeprovided;throughtheuseofafunction,anucleotidesequenceofanylengthcanbegenerated(genseq).Usingthetimexfunction,alistoftimesisgeneratedwhenarateisinputtedintothefunction.Inthiscase,thetotalrateforallnucleotidesofthesequenceisinputtedandalistoftimesgeneratedrandomly,thesetimesareusedasthetimesofmutation.Thistechniquecannotbeusedformorecomplexmodelsofnucleotideevolutionastheyassumeunequaltransitionprobabilitiesandsoafterasubstitution,thetotalratewouldchangewiththedepartureofonenucleotideandthecreationofanewnucleotide.InbasingsimulationusingtheJC69model;thetransitionprobabilitymatrixfortheJC69modelisusedtogeneratetheprobabilitiesformutationsorfornochanges.Thegenseqandtimexfunctionsarebothusedtogenerateasequenceofnucleotidesandtothencreatealistoftimesatwhichmutationswilltakeplace.Pleaselooktothefunctionssectionstowardstheendofthisreportfordefinitionsofeachfunction.3ThefollowingisasequenceofnucleotidesbeforeandaftermutationusingtheJC69transitionprobabilitymatrix:Before
After
Although5differencesarevisiblefromtheinitialsequencetothesequenceaftermutation,7actualmutationshadoccurredwithtwoofthemutationsactingonthesamestartingnucleotide,the8th,withthesecondmutationreturningthe8th
nucleotidebacktoitsstartingstate(nucleotideC).7mutationswereachievedusingthetimexfunctionandinputtingavalueof4.5forrate(at).SimulationofmutationusingtheK80modelrequiresaslightlydifferentmethod,asdoessimulationusingtheHKY85andTN93modelsduetothedifferingprinciplesandparametersbetweeneachmodel.Theseprinciplesarequiteeasilysummarisable:
K80‐astransitionsandtransversionsmustbedistinguishedbetweenastheyoccuratdifferentrates,thefunctionwrittenforsimulatingmutationundertheprobabilitiesgeneratedbytheK80modelaccountsforthis.Thisthenresultsintransitionmutationsandtranversionmutationsoccurringatdifferentratestothenucleotidesequencebeingmutatedaccordingly.
HKY85‐AstheHKY85modelutilisesseveraldifferentparametersandthereforeratestodistinguishprobabilities,thefunctionwrittentosimulateundertheprinciplesofthismodelusesmultiplerateswhenconductingamutation.Also,aseachnucleotideissubjecttodifferentratesofmutation,thetotalratebywhichanymutationwilloccurusingthetimexfunctionisupdatedafteranynucleotideismutatedandchangedintoanothertoaccountforthischange.
TN93‐thefunctionsimulatingmutationundertheprinciplesoftheTN93modelactsinthesamewayasthefunctionusedfortheHKY85model.TheonlydifferenceisthattheTN93modelintroducesanadditionalrate,breakingtherateforalpha(transitions)intoalpha1(transitionsbetweenpyrimidines)andalpha2(transitionsbetweenpurines).
Thefunctionswrittenforthesimulationofthemutationofanucleotidesequenceareincludedintheappendixandarelabelledaccordingly.MaximumLikelihoodEstimates(MLE)‐JC69&K80ModelsMaximumlikelihoodestimatesareusedtoestimateparametervaluesforastatisticalmodelwhenapplyingthatmodeltoadataset.Inthecaseofnucleotidesubstitutions,thestatisticalmodelsfittedtodataarethemodelsofnucleotidesubstitutionandtheparameterestimatedisthevalueforrateandtime.Rateandtimearedealtwithasasinglevalueastheycannotbedistinguishedfromoneanother;thesinglevalue(at)canbeproducedbytheproductofanumberofdifferentcombinationsofvaluesofeitheralphaortime.Thedatasetusedwillbetwosequencesofnucleotidesofequallengthsofwhichonesequencewillbeassumedtohaveevolvedfromtheotherthroughseveralmutations.Thetotallengthofasequenceisrepresentedbytheletternandthedifferences(numbersofnucleotideswhichdifferbetweeneachsequence)isrepresentedbytheletterk.JC69Toexplainthetheorybehindacquiringthemaximumlikelihoodestimate,thebinomialdistributionmustbeconsidered.Thefollowingistheprobabilitymassfunction(pmf)ofthebinomialdistribution:
n= The total length of a sequence. k= The number of differences between the two sequences. Theprobabilitymassfunctionisusedtocalculatetheprobabilitywhenavariable(at)isexactlyequaltothevalueproposedforthevariable.Forexample,ifavalueforatisinputtedintotheprobabilitymassfunction,thevaluecalculatedwillrepresenttheprobabilitythatthevalueforatusedtocalculatetheprobabilityiscorrect.InreplacementofthevariablepistheequationusedinthetransitionprobabilitymatrixfortheJC69modeltocalculatetheprobabilityofamutationoccurring.Theequationusedinreplacementof1‐pistheequationfromthetransitionprobabilitymatrixoftheJC69modelusedtocalculatetheprobabilityofamutationnotoccurring.Thefollowingequationistheprobabilitymassfunction,alteredtoincludethevariablesmentionedabovewiththetotallengthofasequence(n)as100andthenumberofdifferences(k)as40.Thenotationpow(x,y)representsthevaluextothepowerofy:Probabilitymassfunction=l
Thevariablemrepresentsthevalueat.Findingthevalueofatwiththehighestprobabilitycanbefoundthroughtrialanderror,howeverusingPYTHONallvaluesofatwithinanintervalcanbetestedandplottedontoagraph:
Theprobabilitymassfunctionequationdisplayedabovewasusedtogeneratethedatatoplotthisgraph.Thevaluesofm(at)withintheinterval0to0.4weretestedandapeakprobabilitywasacquired.Thepeakrepresentsthevalueofm(at)withthehighestprobabilityofresultinginthevalueofkandthereforeisthemaximumlikelihoodestimate.Inthiscase,themaximumlikelihoodestimateis0.19forat.K80TofindthemaximumlikelihoodestimateusingtheprinciplesoftheK80modelisapproachedinaverysimilarwayaswiththeJC69model.Theprobabilitymass
functionisadjustedsothattwovaluesareestimatedastherearetwoparametersforratesintheK80model,alphaandbeta.Pmf=p0^(n–k‐j)*p1^k*p2^jWhere:p0=theequationusedfromthetransitionprobabilitymatrixoftheK80modeltocalculatetheprobabilityofnomutationoccurring. p1=theequationusedfromthetransitionprobabilitymatrixoftheK80modeltocalculatetheprobabilityofatransitionmutationoccurring. p2=theequationusedfromthetransitionprobabilitymatrixoftheK80modeltocalculatetheprobabilityofatransversionmutationoccurring.
n=thetotallengthofasequence.k=thenumberofdifferencesbetweentwosequencesthathaveresulted
fromtransitionmutations. j=thenumberofdifferencesbetweentwosequencesthathaveresultedfromtransversionmutations.Probabilitymassfunction=l
aandbrepresentthevaluesfortheratesoftransitions(alpha)andtransversions(beta)respectively.UsingPYTHONatablecanbegeneratedshowingtheprobabilitiesofavalueofabeingmostlikelywhenbisofanothervalue.Thevaluesinthistablecanbeplottedgraphicallyusingacontourplot.Thefollowingisacontourplotgeneratedusingtheequationforprobabilitymassfunctiondisplayedabove,howeverthetotallengthofasequence(n)is100,thenumberofdifferencesthathaveresultedfromtransitionmutations(k)is30andthenumberofdifferencesbetweentwosequencesthathaveresultedfromtransversionmutationsis10:
Thelinesbecomeconcentratedaroundthemaximumlikelihoodestimatesforthevaluesofalpha(rateoftransitions)andbeta(rateoftranversions).Theestimateforthemostprobablevalueofbisclearlycentredontheintervalbetween0.12and0.14.Unfortunately,thevalueforaisnotvisibleasthelimitsofthiscontourgraphdonotshowwherethelinesofthegraphcentreonthey‐axis.Maximumlikelihoodestimatesareusedinconjunctionwithmodelsofnucleotideevolutionmainlytoestimatethetimetakenforonesequenceofnucleotidestoevolveintoanother,assumingthatonesequenceistheancestoroftheother.Althoughonlyavalueforat,theproductofbothrateandtime,isachievableifanaveragerate(orratesinthecaseofmultipleparametermodels)isknown.Usingtheknownvalueforrate,thevariableoftimecanbedistinguishedandsothetimetakenforonesequencetomutateintotheotheriscalculatable.Practically,biologistsandstatisticianshaveadoptedthismethodwhenattemptingtocalculatethetimetakenforparticularspecies(suchashumans)tohaveevolvedfromancestralspecies(suchaslesserevolvedprimates).ByassessingthesamesectionsofDNAfromthetwospeciesofthesamelength,thenumberofdifferencesmayberecordedusedtoestimateatimeusingthemaximumlikelihoodmethod.ConclusionsAsmyinvestigationwasnotanexperimentassuchbutratherthetranslationofstatisticalmodelsontosoftwaresoastousethesemodelsinpracticalsituations,myconclusionwouldbetostatethattheprogrammesthatIhavewrittentoemulatethesestatisticalmodelshavebeensuccessfulandsomaybeappliedtopracticaldatasets.Thistranslationallink,betweenstatisticalmodelsandnewcomputingsoftwareembodiesthebasicprinciplesofbioinformaticsandallowsdemonstrationsofhowstatisticiansandbiologistscanthereforeusethesemodelswhendealingwithmutatedsequencesofDNA.IfIhadfurtherresearchtimeandpossiblyslightlymoreoptionsintermsofcomputingsoftware,therearemultipleareasthatIwouldhaveexpandedwithinmyprojectandreport.Firstofall,Iwouldhaveincludedastep‐by‐stepexplanationoftheTaylorSeriesexpansionallowingforreaderstounderstandthemathematicaltheorybehindobtainingthetransitionprobabilitymatrixfromtheratematrixofanucleotidemodel.Also,Iwouldhaveexploredfurthermodelsofnucleotideevolution,astherearemanymoresignificantmodelsthathavenotbeenmentioned.Thesemodelswouldhavebroadenedthescopeofmyprojectandwouldhavedepictedfurtherstepsbywhicheachmodelwaschronologicallyimproved.Withinthelastsectionofthisreport,themaximumlikelihoodestimationoftheJC69andK80models,Ibelievethatthissectioncouldbeprogressedfurther.Withaccesstoalternativecomputingsoftwarethatcouldplotmulti‐dimensionalgraphs,IwouldhaveextendedthecalculationofmaximumlikelihoodestimatesintoestimatingtheparametersfortheHKY85andTN93models.References:
ComputationalMolecularEvolution(Yang2006) www.wikipedia.org
www.python.org http://docs.python.org/lib/module‐random.html http://docs.python.org/lib/module‐random.html http://www.tau.ac.il/~doronadi/F81_model.doc http://www.megasoftware.net/WebHelp/part_iv___evolutionary_analysis/c
omputing_evolutionary_distances/distance_models/nucleotide_substitution_models/hc_jukes_cantor_distance.htm
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080 http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080 EvolutionaryTreesfromDNASequences:
AMaximumLikelihoodApproach(JosephFelsenstein1981) ANovelUseofEquilibriumFrequenciesinModelsofSequenceEvolution
(NickGoldmanandSimonWhelan)
FunctionsGenseq‐thegenerationofarandomsequenceofnucleotidesisessentialtothesimulationofnucleotidesubstitution.Todefineafunctiontogenerateasequence,aparameterforthelengthofthesequencemustbedefined.Inthiscase,nisused.Thefunctionrandomlychoosesaletter,representingeachnucleotide,fromthelist“ACGT”usingthein‐built‘randint’function.Thechosenletterisaddedtoalist;theprocessofchoosingaletteristhenrepeatedntimescreatingalistor‘sequence’nnucleotideslong.
Timex‐thisfunctionallowsforthegenerationofacumulativesetoftimesthatrepresentwhenmutationswilloccurstoanucleotidesequence.Thisfunctionisonlyusedwithinthesimplermodelsofsubstitutionasitassumesthattransitionprobabilitiesarethesameforeachnucleotide.Anin‐builtfunction(random.expovariate)takesavalueforrateasaparameterandgeneratesanothervalueusingthisratevalue.Inputtingahigherratevaluewillincreasetheprobabilityofthein‐builtfunctiongeneratingasmallervalue.Valuesaregeneratedusingthesameratevalueandaredisplayedcumulativelytorepresentthetimesatwhicheventsoccuraccordingtotheinputtedratevalue.Thisprocessisterminatedwhenthecumulativetimevalueincreasesover1asweareonlyinterestedinmutationsoccurringwithintimes0and1.Thisfunctioniseffective,astheoretically,ifeventsoccuratahigherrate,moreeventswilloccurinagiventime.
Intgen‐thisfunctionwascreatedtogeneratealist,oflengthn,ofrandomnumbers.Theserandomnumbersdenoteatwhatpointsmutationswilloccur.Thetimexfunctionisinitiallyusedtocalculatethenumberofmutationsthatwilloccurinanallottedtime.Thenumberofcalculatedmutationswillthensignifythelengthofthe
listofrandomnumbers.Eachnumberwithinthislistreferstothenthnucleotideofasequencebeingmutated.Thatnucleotidewillthenbemutated.