phylip

A primer to phylogenetic analysis usingPhylip packageJarno Tuimala20022Allrightsreserved.ThePDFversionofthisbookorpartsofitcanbeusedinFinnishuniversities as course material, provided that this copyright notice is included. However,thispublicationmaynotbesoldorincludedaspartofotherpublicationswithoutpermission of the publisher.Phylip is freely available from http://evolution.genetics.washington.edu/phylip.html.PHYLIP -- Phylogeny Inference Package (Version 3.2).Felsenstein, J. 1989. Cladistics 5: 164-1663IntroductionPhylip 4User interphase 4Getting startedAlways keep record 5Sequence alignment 5Font Files 6Running Phylip programs 6Essential programs 7Basic analysisDistance methods 8DNA data 9Protein data 13Parsimony methods 14DNA data 14Protein data 18Maximum likelihood methods 19DNA data 19Protein data 22Resampling procedure 22Drawing the tree 26Advanced topicsUser trees 28Estimating transition/transversion ratio 31Estimating base frequencies 32Testing molecular clock 33Inferring ancient states of sequence sites 35Statistical tests of trees 37LogDet-distance 38Computing topological distances between trees 39Weighting 41DNAml, HMM, gamma distribution and rate heterogenity 43RecommendationsPragmatic warnings 45Pitfalls 46Programs 48Data flow charts 49Quick start 504IntroductionThepurposeofthistutorialistodemonstratehowtorunthephylogeneticanalysissoftware,andsomeoftheoptionsthatareavailable.Thistutorialisnotintentedtobeacourseinphylogenetics,althoughsomephylogeneticconceptswillbediscussedbriefly.Thereareotherbooksavailablewhichcoverthethereticalsidesofthephylogeneticanalysis, but the actual data analysis work is less well covered.HerewewillmostlydealwithmolecularsequencedataanalysisinthecurrentPhylipversion 3.6a1.PhylipPhylip is a comprehensive phylogenetic analysis package created by Joseph Felsenstein attheUniversityofWashington.Thispackagecandomanyofthephylogeneticanalysesavailableintheliteraturetoday.Methodsthatareavailableinthepackageincludeparsimony,distancematrix,andlikelihoodmethods,includingbootstrappingandconsensustrees.Datatypesthatcanbehandledincludemolecularsequences,genefrequencies, restriction sites, distance matrices, and 0/1 discrete characters.Phylipisfreelyavailablefromhttp://evolution.genetics.washington.edu/phylip.html.Itships with a comprehensive manual covering the usage of different programs.User interphaseTheprogramsarecontrolledthroughamenu,whichaskstheuserswhichoptionstheywant to set, and allows them to start the computation. The data are read into the programfrom a text file, which the user can prepare using any word processor or text editor (but itisimportantthatthistextfileisnotinthespecialformatofthatwordprocessor--itshouldinsteadbein"flatASCII"or"TextOnly"format).Somesequencealignmentprograms can write data files in the PHYLIP format.Most of the programs look for the data in a file called "infile" -- if they do not find thisfile they then ask the user to type in the file name of the data file. Output is written ontospecialfileswithnameslike"outfile"and"treefile".Treeswrittenonto"treefile"areintheNewickformat,aninformalstandardagreedtoin1986byauthorsofanumberofmajor phylogeny packages.5Getting startedAlways keep recordsIt very important to keep record of lab-procedures you have done, but it is even more sowithcomputeranalyses.Youmighteasilygetconfusedwithmanymanyresultfiles,especially, if you have not given them informative names. And, the computer can crash,orthehard-drivemaycorrupt,andyoucanloseyourwork.Aftersuchinsidentsitiseasier to recover the work you have done, if you have kept a good analysis record.So, always keep records. It indicates you have been working.Sequence alignmentPhylip programs read the aligned sequencies in Phylip-format. Usually you can recognizethefilesinthisformatfromthe.phy-extensionassociatedwiththefiles.Alignedsequencies in the suitable format can be produced with ClustalX program, which is freelyavailablefromhttp://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html.Justbesurethat you save the aligned sequencies in Phylip-format.Ifyouneedtoeditthealignment(withtext-editor)ortodosomeanalysesintheotherprograms,whichdonotreadPhylip-format,savethealignmentalsointhe.aln-format(Clustal-format). The editing of the Clustal alignment format is easier than the editing ofPhylip-format, and Clustal will readily read in the .aln-format, if you later need to convertthe edited sequencies into some other format.Anymultiplesequencealignmentcanalsobemanuallyreformattedwithatexteditor.TheformatrequirementsforPhylipareverystringent,andanydeviationwillresultinaprogram that hangs, usually with the error message "Unable to allocate memory".The file must confirm to the following:1. The file begins with the information about the number of sequencies and the number ofnucleotides or amino acids in the alignment2. The sequence names must be exactly 10 characters long - spaces are allowed in the end3. Gaps must be indicated by -4. Missing data or missing information (no sequence) is indicated by ?. This is especiallyimportantintheendofthedatafile.Gaps(-)intheendofthedatafilemayleadtheprograms to crash.5.Periodwaspreviously,butnolonger,allowed,becauseitsometimesisusedindifferent senses in other programs6. Spaces between the alignment blocks are allowed. This normally makes the alignmentmorereadable.Spacesareusuallyinsertedintothealignmentevery10basesoraminoacids.7.Blankswillbeignored,andsowillnumericaldigits.ThisallowsGENBANKandEMBL sequence entries to be read with minimum editing.6Example of the formatted sequencies: 5 50Rabbit ?????????? ?????????C CAATCTACAC ACGGG-GTAG GGATTACATAHumanAGCCACACCC TAGGGTTGGC CAATCTACTC CCAGGAGCAG GGAGGGCAGGOpossumAGCCACACCC CAACCTTAGC CAATAGACAT CCAGAAGCCC AAAAGGCAAGChickenGCCCGGGGAA GAGGAGGGGC CCGGCGG-AG GCGATAAAAG TGGGGACACAFrog GGATGGAGAA TTAGAGCACT TGTTCTTTTT GCAGAAGCTC AGAATAAACGIf there are ambiguities (such as Y or R nucleotides), these are also handled correctly, anddo not cause trouble.Thepresentversionsoftheprogramsmaysometimeshavedifficultieswiththeblanklines between groups of lines, and if so, you might want to retype those lines, making surethat they have only a carriage-return and no blank characters on them, or you may perhapshavetoeliminatethem.Thesymptomsofthisproblemarethattheprogramscomplainthatthesequencesarenotproperlyaligned,andyoucanfindnoothercauseforthiscomplaint.Font filesInordertobeabletousethetree-drawingtools,thefont-filesneedtobeinthesamefolderasthedrawtreeordrawgramprogram(s).IfyouareusingaPCmachinewherePhylipisinstalled,youshouldnotencounteranytrouble.Therearesixdifferentfontsavailable:font1 simple sans-serif Romanfont2 medium quality sans-serif Romanfont3 high quality sans-serif Romanfont4 medium quality sans-serif Italicfont5 hig quality sans-serif Italicfont6 Russian CyrillicRunning Phylip programsThe programs are used in a sequential way. The output from the first program is used asaninputinthenextprogram.Thetrickistoknowhowtousetheprogramsinsuitablecombinations.Inwindows,thePhylipprogramscanbeinvokedbydouble-clickingontheiconorbytyping the name of the program on the command line.Most Phylip programs run in the same way. The input for a program is taken from a filecalled"infile"andtheresultsarewritteninafilecalled"outfile".Someprogramsmaywrite both "outfile" and a file called "outtree" or "plotfile".Becausemostoftheprogramsusethedefaultnamesfortheinputandoutputfiles,youneed to be sure to rename the files you want to save before proceeding to further analysis.Otherwise you risk losing your results. For example, you get a distance matrix ("outfile")7fromtheprogramDNAdist,butyouwanttotrydifferentmatrixcalculations.Then,beforedoingthematrixcalculationagain,renamethe"outfile"withdnadist_out_F84orsomethingsimilar,sothatyoucantelldifferentanalysisresultsapartafteryouhaveceased working.Essential programsHere is a list of the programs that can be used for the molecular sequence data analysis.The programs are devided into the method categories. The choise of the correct analysismethod is left for the user.Distance methodsTheseprogramsareintentedtobeusedsequentally.FirstadistancematrixiscalculatedbyDNAdistorProtdistprogramfromthemultiplesequencealignment.Thismatrixisthen transformed into a tree by Fitch, Kitch or Neighbor program.DNAdist DNA distance matrix calculationProtdist Protein distance matrix calculationFitch Fitch-Margoliash tree drawing method without molecular clockKitsch Fitch-Margoliash tree drawing method with molecular clockNeighbor Neighbor-Joining and UPGMA tree drawing methodCharacter based methodsThese programs read in the sequence alignment, and produce either one or multiple treesin the output files "outfile" and "outtree".DNApars DNA parsimonyDNApenny DNA parsimony using branch-and-boundDNAml DNA maximum likelihood without molecular clockDNAmlk DNA maximum likelihood with molecular clockProtpars Protein parsimonyProml Protein maximum likelihoodResampling toolThis program reads in a sequence alignment, and generates a specified number of randomsamples. These random samples are usually stored as a new sequence alignment file.Seqboot Generates random samples by bootstrapping or jack-knifingTree drawingThese programs draw a tree from the specifications ("outtree") in the Newick-format. Forexample, the specification can be a file produced by the program DNAml.Drawgram Draws a rooted treeDrawtree Draws an unrooted treeRetree Interactive tree-rearrangement8Consensus treesThis program constructs a consensus tree from multiple trees. For example, DNApars canproducemultipletrees,whichcanbesummarizedbytheprogramConsense.Alsotheresults of the bootstrapping are summarized by the program Consense into a tree.Consense Draws consensus trees from multiple treesTree distancesThisprogramcomputesatopology-baseddistancebetweentwoormoretrees.Thedistance can be used to assess or compare the results from different analyses.Treedist Computes distances between trees based on tree topologyBasic analysisTherearethreedifferentwaystoanalyzeDNAoraminoacidsequencedatainPhylip.Parsimony and maximum likelihood are character based methods, which means that theytreateverysinglesiteofthemultiplesequencealignmentindependently.Distancemethodssummarizethedifferenciesbetweensequenciesbycalculatingapairwisedistance measure between all aligned sequences. After the data-analysis and tree-drawing,the validity of data can be assessed by a bootstrap analysis.Here we present examples of these methods.Theprogramscanbeinvokedbydouble-clickingontheiricons.Ifyouhavedonesomeanalysesbeforewithinthesamefolder,theprogramdetectsthat"outfile"alreadyexists,and it ask you to:dnadist: the file "outfile" that you wanted to use as output file already exists. Do you want to Replace it, Append to it, write to a new File, or Quit? (please type R, A, F, or Q)Replacemeanstooverwrite,Appendaddsthenewresultsintotheoldfile,andFileaskfor a new file name. You can also quit, if you want to check the files before proceeding.Distance methodsForthedistancemethodanalysisyou'llneedatleasttwoprograms.DNAdistorprotdistcalculatesamatrixofpairwisedistancesbetweeneverysequenceinthefile.Thetreeisinferred from those distances by Neighbor, Fitch or Kitsch program.9DNA dataThe DNAdist program writes out a menu:Nucleic acid sequence Distance Matrix program, version 3.6Settings for this run:DDistance (F84, Kimura, Jukes-Cantor, LogDet)?F84T Transition/transversion ratio?2.0G Gamma distribution of rates among sites?NoCOne category of substitution rates?YesW Use weights for sites?NoFUse empirical base frequencies?YesL Form of distance matrix?SquareMAnalyze multiple data sets?NoI Input sequences interleaved?Yes0Terminal type (IBM PC, ANSI, none)?ANSI1 Print out the data at start of runNo2 Print indications of progress of runYesY to accept these or type the letter for one to changeThe settings can be changed by typing in the letter before the option, and pressing Return.For example, typing "d" and return, would cycle through the different distance calculationmethods.Thesedistancesarealsocalledevolutionarymodels.Evolutionarymodelsaremathematicalformulaswhichtrytocompensateforthemultiplesubstitutionproblem.They also take into account the transtion/transversion biases.Briefly, Jukes-Cantor assumes that all substitutions are equally likely to happen. Kimurahastwodifferentchangerates(rateparameters),onefortransitions,andtheotherfortransversions. These models also assume that the equilibrium frequencies of all the basesare 0.25. F84 has two rate parameters, one for transitions, and the other for transversions,butalsoallowestheequilibriumfrequenciesofthebasestodifferfromeachother.LogDet-distanceshouldbeusedifthereare(large)basefrequencydifferenciesbetweenlineages in the tree. The LogDet distance cannot cope with ambiguity codes. It must havecompletely defined sequences.Thetransition/transversionratio(s/v)canbemodifiedifmoredetaileddataisavailableon the s/v ratio. If not, the s/v ratio of 1-2 is often a good approximation of the situationwith most of the genes.Usuallythephylogeneticmethodsassumesthatallthesitesareevolvingatthesamespeed.Thisisclearlyanunrealisticassumption.Forexample,thethirdcodonpositionsevolvewithahigherspeedthansecondpositions.Thatisbecausemostofthesubstitutionsinthethirdpositionareselectivelyneutralwhereassubstitutionsinthesecondcodonpositionoftenleadtoaminoacidchanges.Thisrateheterogenityisoftencompensatedbyusinggammadistribution.Theshapeofthegammadistributionisdefinedbyaparameteralpha.Ifyouactivateoption"g"you'llbepromptedtoenterthisshape parameter:10Coefficientofvariationofsubstitutionrateamongsites(mustbepositive)In gamma distribution parameters, this is 1/(square root of alpha)If you know alpha, you can calculate the prompted CV by the equations:CV = 1 / sqrt(alpha)alpha = (1/CV)^2A good approximation of the alpha for most of the protein encoding genes is 0.5. The useof the gamma distribution is still heavily disputed, and you should be careful when usingthis option.Anotherlayerofratevariationalsoisavailable.Theoption"c"allowsuser-definedratecategories.Theuserispromptedforthenumberofuser-definedrates,andfortheratesthemselves, which cannot be negative but can be zero. These numbers are defined relativeto each other, so that if rates for three categories are set to 1 : 3 : 2.5 this would have thesame meaning as setting them to 2 : 6 : 5. The assignment of rates to sites is then made byreadingafilewhosedefaultnameis"categories".Itshouldcontainastringofdigits1through9.Anewlineorablankcanoccurafteranycharacterinthisstring.Thusthecategories file might look like this:1222311111224111551155333333444Theusercanassigncategoriesofratestoeachsite(forexample,wemightwantfirst,second,andthirdcodonpositionsinaproteincodingsequencetobethreedifferentcategories. This is done with the categories input file and the option "c". We then specify(usingthemenu)therelativeratesofevolutionofsitesinthedifferentcategories.Forexample, we might specify that first, second, and third positions evolve at relative rates of1.0, 0.8, and 2.7.Theweights-option("w")allowsustospecifyweightsontheindividualcharacters.Theweights cause a character to be counted as if it were n characters, where n is the weight.Byuseoftheweightswecangiveweighttosomecharacters,anddropothersfromtheanalysis.Inthemolecularsequenceprogramsonlytwovaluesoftheweights,0or1areallowed. For more information, see the advanced analysis chapter.If you want to specify the base frequencies by yourself, you can do that by invoking theoption "f". The program then prompts you to type in the frequencies of different bases.After you have modified the settings to your liking, you can calculate the distance matrixby typing "y" and pressing return.11This will produce a distance matrix with default name "outfile":5Rabbit0.00000.28180.93860.98301.2617Human 0.28180.00001.07951.14961.5312Opossum 0.93861.07950.00001.53801.8615Chicken 0.98301.14961.53800.00001.5819Frog1.26171.53121.86151.58190.0000The"outfile"needstoberenamed,becausetheprogramswillgoawry,iftheytrybothread and write from and to the same file. After renaming this new file can be used as aninput into the Neighbor, Fitch, or Kitsch program.Neighbor writes out a menu, where you can again change the setting:Neighbor-Joining/UPGMA method version 3.6Settings for this run:N Neighbor-joining or UPGMA tree?Neighbor-joiningOOutgroup root?No, use as outgroup species 1L Lower-triangular data matrix?NoR Upper-triangular data matrix?NoSSubreplicates?NoJ Randomize input order of species?No. Use input orderM Analyze multiple data sets?No0 Terminal type (IBM PC, ANSI, none)?ANSI1Print out the data at start of runNo2Print indications of progress of runYes3Print out treeYes4 Write out trees onto tree file?YesY to accept these or type the letter for one to changeAtthispoint,youcanspecifytheoutgroupofthetree.Outgroupisusuallytheclosestrelativeofallthetaxaunderinvestigation.Forexample,ifhuman,chimpanzeeandgorillaareunderscrutiny,orangutancouldbespecifiedasanoutgroup.Inotherwords,the sister taxa of the ingroup is often used as an outgroup.You can also randomize the input order of the sequencies with option "j", jumble. Quiteoftentheinputorderofthesequenciesaffectstheoutcomeoftheanalysis.Thiscanbeassessedbyrandomizingtheinputorder.Theprogramalsoasksyoutospecifythenumber of times you want to randomize the input order of the sequencies.AlsoFitchwritesoutamenu,wheresettingscanbemodified,butthemenulooksabitdifferent:12Fitch-Margoliash method version 3.6Settings for this run:DMethod (F-M, Minimum Evolution)?Fitch-MargoliashU Search for best tree?YesPPower?2.00000-Negative branch lengths allowed?NoOOutgroup root?No, use as outgroup species 1L Lower-triangular data matrix?NoR Upper-triangular data matrix?NoSSubreplicates?NoGGlobal rearrangements?NoJ Randomize input order of species?No. Use input orderM Analyze multiple data sets?No0 Terminal type (IBM PC, ANSI, none)?ANSI1Print out the data at start of runNo2Print indications of progress of runYes3Print out treeYes4 Write out trees onto tree file?YesY to accept these or type the letter for one to changeFitch-Margoliash is the distance based optimization method, which means that it searchesforatreewiththesmallestvariability.Thereforeyouhavetheoptionofusinganuser-definedtreewithoption"u".Inthiscasetheprogramwill,asdefault,readtheuser-defined tree from the file "intree". This also activates a new option "n". If you choose touse the branch lenghts from the user tree, fitch calculates the Sum of squares and Averagepercent standard deviation for the user-defined tree.Confirmthesettingbytyping"y"andpressingReturn.Neighborcreatestwonewfiles,"outfile" and "outtree".Outfile contains detailed information about the analysis and its results. It also contains theinferred tree drawn with symbol graphics. Also the estimated branch lengths are reportedin the file.Neighbor-Joining/UPGMA method version 3.6 Neighbor-joining method Negative branch lengths allowed+---------------------Opossum+---2! !+------------------Chicken! +----1!+----------------------------Frog!3-Rabbit!+------Humanremember: this is an unrooted tree!13BetweenAndLength---------------- 3Human 0.23064 3 20.12944 2Opossum 0.73871 2 10.17009 1Chicken 0.62698 1Frog0.95492 3Rabbit0.05116File"outtree"containstheactualtree-drawinginstructions.Italsoreportsthebranchlenghts, which are the same as in the file "outfile".(Human:0.22996,((Frog:0.95134,Chicken:0.63056):0.16672,Opossum:0.74182):0.12891,Rabbit:0.05184);Protein dataThedistancemethodworksmoreorlesssimilarlythanwithDNAdata.Theprogramprotdist writes out a menu with modifiable settings. After you have modified them to yourliking,theprogramproducesan"outfile",whichcontainspairwisedistances.Thosedistances can be transformed into a tree as has been described in the DNA data-section.Therearethreeoptionsofevolutionarymodelstochoosefrom:PAM,Kimuraandcategories.PAMmodelusesthePAM001modelforcalculations.Kimuramodelistheplain D-distance (the uncorrected raw distance) and categories is a model put together byFelsenstein. See Phylip documentation for more information.Protein distance algorithm, version 3.6Settings for this run:PUse PAM, Kimura or categories model?Dayhoff PAM matrixGGamma distribution of rates among positions?NoM Analyze multiple data sets?NoIInput sequences interleaved?Yes0 Terminal type (IBM PC, ANSI)?ANSI1Print out the data at start of runNo2Print indications of progress of runYesAre these settings correct? (type Y or the letter for one to change)yComputing distances:CAS1_HUMANCAS1_RABIT .CAS1_MOUSE ..CAS1_BOVIN ...CAS1_SHEEP ....CAS1_PIG .....Output written to output file14Parsimony methodsDNA dataThe program available for DNA parsimony are DNApars and DNApenny.DNApars is controlled through this menu:DNA parsimony algorithm, version 3.6Setting for this run:U Search for best tree?YesSSearch option?More thorough searchVNumber of trees to save?100J Randomize input order of sequences?No. Use input orderOOutgroup root?No, use as outgroup species 1TUse Threshold parsimony?No, use ordinary parsimonyN Use Transversion parsimony?No, count all stepsW Sites weighted?NoM Analyze multiple data sets?NoIInput sequences interleaved?Yes0 Terminal type (IBM PC, ANSI, none)?ANSI1Print out the data at start of runNo2Print indications of progress of runYes3Print out treeYes4Print out steps in each siteNo5Print sequences at all nodes of treeNo6 Write out trees onto tree file?YesY to accept these or type the letter for one to changeInvoking the option "s" will allow you to specify some options of more or less thoroughsearch. You can even deside to do rearrangements for one tree only.Oftenparsimonyanalysisproducesmanyequallyparsimonioustrees,whichcanthenbeanalyzedmorecloselywith,e.g.,maximumlikelihoodorsomeothermethods.Youcanspecify the number of trees to save with option "v".Sometimes it is a good practise to limit the number of changes one site can contribute tothe tree. This can be accomplished with option "t". With "t" you can specify a thresholdover which all the changes are regarded as a threshold number of changes.Transversionparsimony(option"n")usesonlytransversionsfortheparsimonyanalysis.Thistriestoremovethebiasresultingfromthemorerapidaccumulationoftransitionsubstitutions to the DNA.Theoption"5"canbeusedtoinferancientstatesofsites.Ifyouchoosetoprintsequenciesatallnodesofthetree,the"outfile"willcontainmoreinformationthannormally:15FromTo Any Steps?State at upper node . means same as in the node below it on tree1GTYCAGGGCT -GGGCATAAA AGGCAGAGCA GGGCCAGCTR 12 yes...NR..... .?........ .......... A.RT...... 23 yes..C..CB... ..CA.VDGHD T.V...CV.. ........A. 3 Frog yesC..AA.TTTG GC..TGG.TT ..A....A.. T.A..GT..G 3 Chickenyes.A.GG.C... .-...CA.CG ..CT.T.C.C .CGGG....A 2 Opossumyes..TTG...GC CA..G..... .........T ..A.T..T.T 1 HumanyesAGC....... .......... ..T...G... .A....T..A 1 Rabbit yes..T....A.. T......... .......... ....-....G1CTGCTTACAH 12 yes.W.....A.M 23 yes.....C.... 3 Frog yes.T.A....CA 3 ChickenyesGA..C..G.C 2 Opossumyes.A..A.C.TA 1 HumanyesT........T 1 Rabbitmaybe .........CDNApenny has a different kind of menu:Penny algorithm for DNA, version 3.6 branch-and-bound to find all most parsimonious treesSettings for this run:HHow many groups of100 trees:1000FHow often to report, in trees: 100S Branch and bound is simple?YesOOutgroup root?No, use as outgroup species 1TUse Threshold parsimony?No, use ordinary parsimonyM Analyze multiple data sets?NoIInput sequences interleaved?Yes0 Terminal type (IBM PC, ANSI, none)?ANSI1Print out the data at start of runNo2Print indications of progress of runYes3Print out treeYes4Print out steps in each siteNo5Print sequences at all nodes of treeNo6 Write out trees onto tree file?YesAre these settings correct? (type Y or the letter for one to change)You can specify ("h") how many hundreds of trees you want to search, and how often thereport is printed on the screen ("f").Youcanalsosetthebranch-and-boundalgorithmtoreconsidertheorderofthespecieswith option "s".The differences between these DNA parsimony programs are in the algorithm. DNAparssearchesforthemostparsimonioustreebyaheuristicalgorithm,whichdoesnotguaranteethattheshortesttreeisfound.DNApennyusesthebranch-and-boundalgorithm,whichguaranteesthattheshortesttreeisfound,buttakesquiteamuchcomputer time.16DNApars gives only a little information during the run:Adding species: 1. Rabbit 2. Human 3. Opossum 4. Chicken 5. FrogDoing global rearrangements on all trees tied for best!---------! ......... .........Output written to output fileTrees also written onto tree fileDone.DNApenny gives information about the search in real time, so you can estimate how longthe run is going to take:How manytrees looked Approximateat so farLength ofHow many percentage(multiples longest tree trees this longsearchedof100): found so far found so far so far---------- ------------ ------------ ------------Output written to output fileTrees also written onto tree fileBothprogramsproduce"outfile"and"outtree",buttheylookabitdifferent.DNAparsinfersthebranchingorderofthetreeandestimatesthebranchlenghts,butDNApennyonly infers the branching order.Thus, the "outfile" and "outtree" of DNApars:DNA parsimony algorithm, version 3.6One most parsimonious tree found:+--------------------Frog +--------3+----------2+-----------------Chicken|||+----------------Opossum|1----------Human|+------------Rabbit17requires a total of 3019.000betweenand length---------- ------ 1 2 0.181782 2 3 0.150139 3Frog 0.347616 3Chicken0.304468 2Opossum0.291273 1Human0.182778 1Rabbit 0.219167(((Frog:0.34762,Chicken:0.30447):0.15014,Opossum:0.29127):0.18178,Human:0.18278,Rabbit:0.21917);looks different from those produced by DNApenny:Penny algorithm for DNA, version 3.6 branch-and-bound to find all most parsimonious treesrequires a total of 3019.000One most parsimonious tree found:+-----------Rabbit!!+--Frog1 +--3!+--2+--Chicken!!!+--4+-----Opossum ! +--------Humanremember: this is an unrooted tree!(Rabbit,(((Frog,Chicken),Opossum),Human));18Protein dataProtein parsimony program protpars is comparable to DNApars, and its menu looks verysimilar:Protein parsimony algorithm, version 3.6Setting for this run:U Search for best tree?YesJ Randomize input order of sequences?No. Use input orderOOutgroup root?No, use as outgroup species 1TUse Threshold parsimony?No, use ordinary parsimonyC Use which genetic code?UniversalM Analyze multiple data sets?NoIInput sequences interleaved?Yes0 Terminal type (IBM PC, ANSI, none)?ANSI1Print out the data at start of runNo2Print indications of progress of runYes3Print out treeYes4Print out steps in each siteNo5Print sequences at all nodes of treeNo6 Write out trees onto tree file?YesAre these settings correct? (type Y or the letter for one to change)You can specify the genetic code used for the analysis by option "c". You have an optionto choose from universal and four mitochondrial codes.However, its "outfile" and "outtree" are similar to the output of DNApenny:Protein parsimony algorithm, version 3.6One most parsimonious tree found: +-----CAS1_PIG+--5!!+--CAS1_SHEEP +--3+--4 !! +--CAS1_BOVIN+--2!!!+--------CAS1_MOUSE1!!+-----------CAS1_RABIT!+--------------CAS1_HUMANremember: this is an unrooted tree!requires a total of 1061.000((((CAS1_PIG,(CAS1_SHEEP,CAS1_BOVIN)),CAS1_MOUSE),CAS1_RABIT),CAS1_HUMAN);19Maximum likelihood methodsTherearetwomaximumlikelihoodprogramsforDNAdata,andoneforproteindata.DNAmlusesamaximumlikelihoodmethodwithoutmolecularclock,andDNAmlkassumes a molecular clock. Proml does not assume molecular clock.DNA dataThe menu of the DNAml:Nucleic acid sequence Maximum Likelihood method, version 3.6Settings for this run:U Search for best tree?YesTTransition/transversion ratio:2.0000F Use empirical base frequencies?YesCOne category of sites?YesR One region of substitution rates?YesW Sites weighted?NoSSpeedier but rougher analysis?YesGGlobal rearrangements?NoJ Randomize input order of sequences?No. Use input orderOOutgroup root?No, use as outgroup species 1M Analyze multiple data sets?NoIInput sequences interleaved?Yes0 Terminal type (IBM PC, ANSI, none)?ANSI1Print out the data at start of runNo2Print indications of progress of runYes3Print out treeYes4 Write out trees onto tree file?Yes5 Reconstruct hypothetical sequences?NoY to accept these or type the letter for one to changeMost of the options have already been covered in the chapters above. However, there aretwo new options, "s" and "g" with which you can modify the search parameters a bit.Theoption"s"turnsonoroffthesearchmethodwhichiteratesordoesnotiteratethebranch lenghts in all topologies. Turning this option off ("No, not rough") will cause theprogram to run more slowly, but it will also be a bit more likely to find the tree topologyof highest likelihood.The"g"(globalsearch)optioncauses,afterthelastspeciesisaddedtothetree,eachpossible group to be removed and re-added. This improves the result, since the position ofevery species is reconsidered. It approximately triples the run-time of the program.Ifmorethanonecategoryisspecified,thenanotheroption,"a",becomesvisibleinthemenu.Thisallowsustospecifythatwewanttoassumethatsitesthathavethesameregional rate category are expected to be clustered so that there is autocorrelation of rates.The program asks for the value of the average patch length. This is an expected length ofpatches that have the same rate. If it is 1, the rates of successive sites will be independent.If it is, say, 10.25, then the chance of change to a new rate will be 1/10.25 after every site.However the "new rate" is randomly drawn from the mix of rates, and hence could even20bethesame.Sotheactualobservedlengthofpatcheswiththesameratewillbeabitlargerthan10.25.Notebelowthatifyouchoosemultiplepatches,therewillbeanestimate in the output file as to which combination of rate categories contributed most tothe likelihood.The menu of the DNAmlk:Nucleic acid sequence Maximum Likelihood method with molecular clock, version 3.6Settings for this run:U Search for best tree?YesTTransition/transversion ratio:2.0F Use empirical base frequencies?YesC One category of substitution rates?YesR One region of substitution rates?YesGGlobal rearrangements?NoJ Randomize input order of sequences?No. Use input orderM Analyze multiple data sets?NoIInput sequences interleaved?Yes0 Terminal type (IBM PC, ANSI, none)?ANSI1Print out the data at start of runNo2Print indications of progress of runYes3Print out treeYes4 Write out trees onto tree file?YesAre these settings correct? (type Y or the letter for one to change)The"outfile"oftheDNAmlisquitesimilartotheoutfileofDNAmlk.Thefirstisthe"outfile" from DNAml analysis, and the latter is the same file from DNAmlk.Nucleic acid sequence Maximum Likelihood method, version 3.6Empirical Base Frequencies: A 0.25650 C 0.21951 G 0.22091T(U) 0.30309Transition/transversion ratio = 2.000000+-----Human|| +--------------------------Frog| +-----31-----2 +-----------------Chicken| || +--------------------Opossum|+--Rabbitremember: this is an unrooted tree!21Ln Likelihood = -9695.14457 BetweenAndLengthApprox. Confidence Limits ----------------------- ---------- ------1Rabbit0.09743 (0.06975, 0.12511) **1Human 0.20090 (0.16716, 0.23474) **1 20.18632 (0.12823, 0.24451) **2 30.19152 (0.11910, 0.26384) **3Frog0.90555 (0.78325, 1.02776) **3Chicken 0.58309 (0.49455, 0.67166) **2Opossum 0.69201 (0.60518, 0.77894) ** *= significantly positive, P < 0.05 ** = significantly positive, P < 0.01Nucleic acid sequence Maximum Likelihood method with molecular clock, version 3.6Empirical Base Frequencies: A 0.25650 C 0.21951 G 0.22091T(U) 0.30309Transition/transversion ratio = 2.000000+---------------------------------------Frog--4!+--------------------------------Chicken+------3 ! +------------------------Opossum +-------2 ! +-------Human +-----------------1 +-------RabbitLn Likelihood = -9714.14764AncestorNodeNode Height Length ---------------- ------ ------ root4 4Frog0.776260.77626 4 30.138420.13842 3Chicken 0.776260.63785 3 20.278150.13973 2Opossum 0.776260.49812 2 10.626020.34787 1Human 0.776260.15025 1Rabbit0.776260.1502522Asyoucansee,DNAmlproducestheconfidenceintervalsofthebranchlenghtsofthetree, but DNAmlk doesn't. DNAml also produces a rough estimate of the p-value that thelenght of the branch is significantly greater than zero. This test is done by comparing thetree with the inferred branch lenght to a tree where the same branch has been scaled to bezero.Thisresultisonlyanapproximation,sodonotrelyonitwheninterpretingtheresults.Thereportedlikelihoodisactuallyanaturallogarithmofthelikelihood.Thecloserthelikelihoodvalueistozero,thebetter.Itcanbeusedforstatisticaltestingofthehypothesis, as will be discussed in the advanced analysis chapter.The "outtree" from both programs is identical in its form:(Human:0.20090,((Frog:0.90555,Chicken:0.58309):0.19152,Opossum:0.69201):0.18632,Rabbit:0.09743);Protein dataThe startup menu of the program proml is similar to DNAml:Amino acid sequence Maximum Likelihood method, version 3.6Settings for this run:U Search for best tree?YesCOne category of sites?YesR One region of substitution rates?YesW Sites weighted?NoSSpeedier but rougher analysis?YesGGlobal rearrangements?NoJ Randomize input order of sequences?No. Use input orderOOutgroup root?No, use as outgroup species 1M Analyze multiple data sets?NoIInput sequences interleaved?Yes0 Terminal type (IBM PC, ANSI, none)?ANSI1Print out the data at start of runNo2Print indications of progress of runYes3Print out treeYes4 Write out trees onto tree file?Yes5 Reconstruct hypothetical sequences?NoY to accept these or type the letter for one to changeTheprogrampromlassumesnomolecularclock,andproducesoutputfileswhichareidentical to the output files of the DNAml.Resampling procedureResamplingwithreplacement(bootstrapping)canbedonewiththeprogramseqboot.First a random sample is generated from the original sequence alignment:Random number seed (must be odd)?1123Bootstrapping algorithm, version 3.6Settings for this run:D Sequence, Morph, Rest., Gene Freqs?Molecular sequencesJ Bootstrap, Jackknife, or Permute?BootstrapB Block size for block-bootstrapping?1 (regular bootstrap)RHow many replicates?100W Read weights of characters?NoC Read categories of sites?NoFWrite out data sets or just weights?Data setsIInput sequences interleaved?Yes0 Terminal type (IBM PC, ANSI, none)?ANSI1Print out the data at start of runNo2Print indications of progress of runYesY to accept these or type the letter for one to changeYou have several options to choose from.First, there are three resampling procedures available in the seqboot program (option "j").Bootstrappingcreates(withblocksize=1,option"b")newdatasets,whichareofequallenght compared to the original sequence alignment. Jackknife deletes a random 50% ofsites from the original alignment. Permutation produces data matrices that have the samenumberandkindsofcharactersbutnotaxonomicstructure.Itisusedfordifferentpurposes than the bootstrap, as it tests not the variation around an estimated tree but thehypothesis that there is no taxonomic structure in the data: if a statistic such as number ofstepsissignificantlysmallerintheactualdatathanitisinreplicatesthatarepermuted,thenwecanarguethatthereissometaxonomicstructureinthedata(thoughperhapsitmight be just a pair of sibling species).The option "r" sets the number of new random data matrixes produced.Youcanalsouseweightswiththebootstrappingprocedure(option"w").Ifyouwanttosavetheharddiskspace,youcangenerate,insteadofnewdataset,afilecontainingweights. These weights can then be used in the same way as the new bootstrapped data-set.Only,insteadofusingmultipledatasetintheanalysisprogram,youneedtousetheweights (see the advanced analysis chapter).Accepting the settings produces a file where there are 100 random samples:6 333CAS1_MOUSE KLCLAAAFMR RHHHSSNNNN VSSSSQQQQQ QQHHSSE--- --------II IFQQPKYYYLCAS1_RABIT KLCLTTALRK KHHHLL---- HLLLKLLQQE QQPPSSQEEI ILLKKERLRR RFQQTVPPPLCAS1_PIG KFCLVVALRK KPPPLL---- HQQQEHHQQN EEPPSRELLF FKKEERKRFF FPVVPLLLLSCAS1_HUMAN RLCLVVALRK KPPPLL---- YPPPERRQQN PPSSSS---- ---------- ----PIPPPLCAS1_SHEEP KLCLVVALRK KPPPII---- HQQQGLLPP- ------EVVL LNNEEN-RFF FVAAPFPPPECAS1_BOVIN KLCLVVALRK KPPPII---- HQQQGLLQQ- ------EVVL LNNEEN-RFF FFAAPFPPPE6 333CAS1_HUMAN MRRLLLLLCL AAARPKKPLL RYPPPRRLQN PESSSE---- ---------- --PPPPPLEECAS1_SHEEP MKKLLLLLCL AAARPKKPII KHQQQLLSP- ------EEVN N-LLRFFVVV VVPPPPPEVVCAS1_MOUSE MKKLLLLLCL AAAMPRRHSS RVSSSQQTQQ QSSSSEEE-- -----IIFFK KKPPYYYLNNCAS1_PIG MKKLLLFFCL AAARPKKPLL RHQQQHHLQN EDSRREEELR RKFFRFFPPE EEPPLLLSQQCAS1_RABIT MKKLLLLLCL AAARHKKHLL GHLLLLLTQE QESSSEQQEE ERKKLRRFFV VVTTPPPLEECAS1_BOVIN MKKLLLLLCL AAARPKKPII KHQQQLLPQ- ------EEVN N-LLRFFFFV VVPPPPPEVVetc.24Then we proceed with this file as if it was the original. In our case, lets produce distancetrees with the program protdist.Protein distance algorithm, version 3.6Settings for this run:PUse PAM, Kimura or categories model?Dayhoff PAM matrixGGamma distribution of rates among positions?NoM Analyze multiple data sets?NoIInput sequences interleaved?Yes0 Terminal type (IBM PC, ANSI)?ANSI1Print out the data at start of runNo2Print indications of progress of runYesAre these settings correct? (type Y or the letter for one to change)mHow many data sets?100We have to tell the program that there are multiple datasets in the same file by changingthe option m.Thenthedistancematrixiscalculatedforallthe100randomsamples.Theoutfileproduced contains all these distance matrixes:6CAS1_MOUSE 0.000000.806221.460411.149972.119281.98883CAS1_RABIT 0.806220.000001.357720.821521.764011.49899CAS1_PIG 1.460411.357720.000001.300750.611100.58505CAS1_HUMAN 1.149970.821521.300750.000001.724991.56997CAS1_SHEEP 2.119281.764010.611101.724990.000000.08268CAS1_BOVIN 1.988831.498990.585051.569970.082680.000006CAS1_HUMAN 0.000002.030321.019381.276900.694971.95053CAS1_SHEEP 2.030320.000002.152710.718471.790890.16285CAS1_MOUSE 1.019382.152710.000001.476330.902481.87560CAS1_PIG 1.276900.718471.476330.000001.527770.77397CAS1_RABIT 0.694971.790890.902481.527770.000001.54938CAS1_BOVIN 1.950530.162851.875600.773971.549380.00000And so forthThisdistancematrixisthenusedfortreedrawingwiththeoriginalmethod,i.e.,iftheoriginaldatawasanalysedwithF84-distanceandneighbor-joining,thebootstrappinganalysis has to be performed with the same settings.Again,intheprogramneighborthemultipledatasetsoption(m)hastobechosen.Theresulting outtree contains trees for all the 100 random datasets.((CAS1_PIG:0.09668,(CAS1_SHEEP:0.11341,CAS1_BOVIN:-0.03073):0.46006):0.75811,CAS1_HUMAN:0.41697,(CAS1_MOUSE:0.55088,CAS1_RABIT:0.25534):0.16567);(((CAS1_SHEEP:0.14929,CAS1_BOVIN:0.01356):0.52397,CAS1_PIG:0.14082):0.82720,CAS1_MOUSE:0.54493,(CAS1_HUMAN:0.38539,CAS1_RABIT:0.30958):0.06851);And so forth25These datasets are then combined in a consensus tree with the program consense.Majority-rule and strict consensus tree program, version 3.6Settings for this run: C Consensus type (strict, MR, MRe, Ml):Majority rule (extended) OOutgroup root:No, use as outgroup species RTrees to be treated as Rooted:No T Terminal type (IBM PC, ANSI, none):ANSI 1Print out the sets of species:Yes 2 Print indications of progress of run:Yes 3 Print out tree:Yes 4 Write out trees onto tree file:YesAre these settings correct? (type Y or the letter for one to change)There are four consensus tree types to choose from. Strict consensus creates a tree whichonly includes the set of sequencies, if it occurs in all the trees. The MR, MRe and M1 allproduceamajorityruletreeswithslightlydifferentoptions.Thedefaultmethod(MRe)will include into the new tree all the groups of sequencies, which are present in more than50% of the trees. M1 lets you to specify the percentage.Note,thattheconsensustreefrombootstrappingsamplesshouldalwaysbedrawnwithmajority rule method.The "outfile" and "outtree" contain the information of how many times the set of specieswas find to be together in the random samples. If this value is under, say 95%, the resultshouldbeinterpretedwithcausion.Thisalsoimpliesthatthereprobablyisnotenoughdata to differentiate between the species, which have a low bootstrapping value.Majority-rule and strict consensus tree program, version 3.6Species in order:CAS1 PIGCAS1 SHEEPCAS1 BOVINCAS1 HUMANCAS1 MOUSECAS1 RABITSets included in the consensus treeSet (species in order) How many times out of100.00...*** 100.00.**... 100.00....** 77.0026Sets NOT included in consensus tree:Set (species in order) How many times out of100.00...*.* 16.00...**.7.00Extended majority rule consensus treeCONSENSUS TREE:the numbers at the forks indicate the numberof times the group consisting of the specieswhich are to the right of that fork occurredamong the trees, out of 100.00 trees+-------------CAS1 HUMAN +100.0-| ||+------CAS1 RABIT |+-77.0-|+------| +------CAS1 MOUSE|||| +------CAS1 SHEEP|+-------100.0-||+------CAS1 BOVIN|+---------------------------CAS1 PIGremember: this is an unrooted tree!(((CAS1_HUMAN:100.0,(CAS1_RABIT:100.0,CAS1_MOUSE:100.0):77.0):100.0,(CAS1_SHEEP:100.0,CAS1_BOVIN:100.0):100.0):100.0,CAS1_PIG:100.0);The bootstrapping procedure implemented in Phylip does not perform the analysis on theoriginaltree.Instead,Felsensteinhasarguedthatthemostreasonabletreewouldbetheone recoved from the random samples as decribed above. However, the original inferredtree and the tree produced by bootstrapping are usually pretty similar.Drawing the treeNowwecandrawanunrootedtreefromtheinformationcontainedinthefile"outtree"with the program drawgram.Drawgram: can't find input tree file "intree"Please enter a new file name> plottreeDRAWGRAM from PHYLIP version 3.6Reading tree ...Tree has been read.Loading the font ....Drawgram: can't find font file "fontfile"Please enter a new file name> font1Font loaded.27Rooted tree plotting program version 3.6Here are the settings: 0Screen type (IBM PC, ANSI):ANSI P Final plotting device:Postscript printer V Previewing device:X Windows display HTree grows:Horizontally STree style:Phenogram BUse branch lengths:Yes L Angle of labels:90.0 RScale of branch length:Automatically rescaled D Depth/Breadth of tree:0.53 TStem-length/tree-depth:0.05 CCharacter ht / tip space:0.3333 A Ancestral nodes:Weighted FFont:Times-Roman MHorizontal margins:1.65 cm MVertical margins:2.16 cm #Pages per tree:one page per tree Y to accept these or type the letter for one to changeFirst,wewanttobeabletoviewthefileeasilyinWindows,andwechangetheFinalplotting device by typing "p" and Return.A new menu opens:Which plotter or printer will the tree be drawn on?(many other brands or models are compatible with these) type: to choose one compatible with:L Postscript printer file formatM PICT format (for drawing programs)J HP Laserjet PCL file formatW MS-Windows BitmapF FIG 2.0 drawing program formatA Idraw drawing program formatZ VRML Virtual Reality Markup Language fileP PCX file format (for drawing programs)K TeKtronix 4010 graphics terminalX X Bitmap formatV POVRAY 3D rendering program fileR Rayshade 3D rendering program fileH Hewlett-Packard pen plotter (HPGL file format)D DEC ReGIS graphics (VT240 terminal)E Epson MX-80 dot-matrix printerC Prowriter/Imagewriter dot-matrix printerT Toshiba 24-pin dot-matrix printerO Okidata dot-matrix printerB Houston Instruments plotterU other: one you have inserted code for Choose one:FromwhichwewillchooseMS-WindowsBitmapbytyping"w"andReturn.Theprogramthenasksthedesiredresolutionofthepicture(onthenextpage).Afterspecifying that, you'll be dropped back to the main menu.28Please select the MS-Windows bitmap file resolutionX resolution?640Y resolution?400After accepting the settings by typing in "y" and Return, you will get a messageWe now will preview the tree.The box that will beplotted on the screen represents the boundary of thefinal plotting surface.To see the preview, press onthe ENTER or RETURN key (you may need to do it twice).When finished viewing it, press on that key again.Is the tree ready to be plotted? (Answer Y or N)Type Y or N:yWriting plot file ... writing20 lines ...Line Output file size---- ------ ---- ----5 0 10 0 15 0 20 0Done.And the tree is plotted in the "plotfile".The options of the program drawtree are similar, but the program produces a rooted tree.Advanced topicsHeresomemoreadvancedanalysisoptionsareconsidered.Theseoptionsareseldomused by researchers, because they do not know about these possibilites.User treesUsertreeshavemultiplepurposes.Theycanbeusedfortransition/transversionratioapproximation,likelihoodcomparisonsetc.Thesepurposeswillbecoveredinthesucceeding sections.Usertreesshouldbeinaplaintextfilenamed"intree".Thisfileconsistsofthelinegiving the number of trees, succeeded by the trees. Some programs do not read the branchlenghtsorthenumberofthetrees,sothisneedsalittleexperimentation.However,thebasic form of the "intree" file is: 1(Frog,(Chicken,(Opossum,(Human,Rabbit))));29Theeasiestwaytoproduceuser-treesistoperformananalysiswithsomeoftheprograms, and then modify the file outtree with the program retree or by some text editor.Program retreeRetree reads in a treefile, which is in Newick format. For example, a tree produced withDNAml can be directly imported into retree. In this example we modify the tree:(((Frog:0.34762,Chicken:0.30447):0.15014,Opossum:0.29127):0.18178,Human:0.18278,Rabbit:0.21917);when the program starts, you get a menu:Tree Rearrangement, version 3.6Settings for this run:UInitial tree (arbitrary, user, specify)?User tree fromtree fileN Format to write out trees (PHYLIP, Nexus, XML)?PHYLIP0 Graphics type (IBM PC, ANSI)?ANSIW Width of terminal screen, of plotting area?80, 80LNumber of lines on screen?24Are these settings correct? (type Y or the letter for one to change)After you have confirmed these setting, a new view opens:lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq1:Frog lqqqqqqqq8lqqqqqqqqqqqqqqqqqq7mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq2:Chickenxxxmqqqqqqqqqqqqqqqqqqqqqqqqqqqqq3:Opossumxqq6qqqqqqqqqqqqqqqqqq4:Humanxmqqqqqqqqqqqqqqqqqqqqqq5:RabbitNEXT? (Options: R . = U W O M T F D B N H J K L C + ? X Q) (? for Help)The options of the program can be printed on the screen by typing "?" and Return: . Redisplay the same tree again = Redisplay the same tree without/with lengths U Undo the most recent change in the tree W Write tree to a file + Read next tree from file (may blow up if none is there) R Rearrange a tree by moving a node or group O select an Outgroup for the tree M Midpoint root the tree T Transpose immediate branches at a node F Flip (rotate) subtree at a node D Delete or restore nodes B Change or specify the length of a branch N Change or specify the name(s) of tip(s)30 H Move viewing window to the left J Move viewing window downward K Move viewing window upward L Move viewing window to the right C show only one Clade (subtree) (might be useful if tree is too big) ? Help (this screen) Q (Quit) Exit from program X Exit from program TO CONTINUE, PRESS ON THE Return OR Enter KEYNowyoucanrerootthetree,swapthebranches,etc.Inthiscase,wewanttoremovebranch lenghts from all the branches of the tree. This can be done by invoking the option"b":NEXT? (Options: R . = U W O M T F D B N H J K L C + ? X Q)(? for Help) bSpecify length of which branch (0 = all branches)? 0 (this operation cannot be undone) enter U to leave the lengths unchangedOR enter R to remove the lengths from all branches:r l>>1:Frogl>>8l>>>>>7m>>2:Chickenx xx m>>>>>3:Opossumxqq6>>>>>>>>>>>4:Humanxm>>>>>>>>>>>5:RabbitNext we want to save the tree and quit the program:NEXT? (Options: R . U W O T F D B N H J K L C + ? X Q) (? for Help) xDo you want to write out the tree to a file? (Y or N) yretree: the file "outtree" that you wanted to use as output tree file already exists. Do you want to Replace it, Append to it, write to a new File, or Quit? (please type R, A, F, or Q)rEnter R if the tree is to be rootedOR enter U if the tree is to be unrooted: uTree written to file(((Frog,Chicken),Opossum),Human,Rabbit);We still need to add the line containing the number of the trees into the produces file: 1(((Frog,Chicken),Opossum),Human,Rabbit);31Estimating transition/transversion ratioEstimatingthetransition/transversionratioisprettyeasywiththeprogramdnapars.Aparsimonytreewithtransversionparsimonyturnedoff(default)isfirstcalculated.Fromthe outfile, note the number of changes required for the tree (in this case, 3019).Fromtheouttree,constructatreefile("intree")whichcontainstheinferredtreewithoutbranchlenghts(forhelp,seeabove).Remembertoaddalinespecifyingthenumberoftrees in the file.Runthednaparswithexactlythesamesettingsaspreviously,exceptgiveitanintree,which we just produced, and use the transversion parsimony. Note the number of changesfrom the outfile (in this case, 2091).Fromthesevalues,thetransition/transversionratiocanbeestimated:3019/2091=1.44.Youcanthenusethisvalueinmoredetailedanalyses,e.g.maximumlikelihood,etc.However, the estimated ratio seems to depend on the specified outgroup. So, if you haveinformation about the outgroup you are planning to use in the forthcoming analysis, youshould use it in this estimation process too.Anotherwaytoestimatethetransition/transversionratioistorunmaximumlikelihoodprogramDNAmlmultipletimes,andtrytofindthevalue,whichminimizesthelikelihoodvalue.Ifyouusethisprocedure,youcanusuallygetquitesimilarresultsaswith the aforementioned parsimony method.The maximum likelihood method works this way:s/t-ratio likelihood1.44 -9652.96081 (initial value from parsimony analysis)1.25 -9644.490781.15 -9642.295361.10 -9641.965611.05 -9642.239881.00 -9643.200090.95 -9644.94070.90 -9647.571380.85 -9651.21944Ifthetreeistheninferredwiththes/t-ratioinferredbyparsimonyandmaximumlikelihood, the tree looks identical in our case, but the branch lenghts vary a bit:32+-----Human||+-----------------------Frog| +----31-----2+---------------Chicken| || +------------------Opossum|+--Rabbitparsimony estimationremember: this is an unrooted tree!Ln Likelihood = -9652.96081 BetweenAndLengthApprox. Confidence Limits ----------------------- ---------- ------1Rabbit0.09923 (0.07218, 0.12617) **1Human 0.19303 (0.16064, 0.22542) **1 20.18455 (0.13216, 0.23698) **2 30.15719 (0.09442, 0.21997) **3Frog0.80197 (0.70133, 0.90253) **3Chicken 0.53674 (0.46077, 0.61276) **2Opossum 0.61599 (0.54185, 0.69013) ** *= significantly positive, P < 0.05 ** = significantly positive, P < 0.01maximum likelihood estimationLn Likelihood = -9641.96561 BetweenAndLengthApprox. Confidence Limits ----------------------- ---------- ------1Rabbit0.10097 (0.07432, 0.12755) **1Human 0.18761 (0.15605, 0.21917) **1 20.18557 (0.13596, 0.23522) **2 30.13641 (0.07800, 0.19475) **3Frog0.75496 (0.66344, 0.84638) **3Chicken 0.51613 (0.44578, 0.58657) **2Opossum 0.57699 (0.50869, 0.64529) ** *= significantly positive, P < 0.05 ** = significantly positive, P < 0.01Estimating base frequenciesAbove,amaximumlikelihoodmethodforestimatingtransition/transversionratioispresented. A similar method can be used for inferring the maximum likelihood estimatorsof base frequencies.Empirical frequencies are: A 0.25650 C 0.21951 G 0.22091T(U) 0.3030933We can start the estimation by option "f", which then prompts for base frequencies. Thefrequency values should be separated by blanks:Base frequencies for A, C, G, T/U (use blanks to separate)?0.25 0.22 0.22 0.31Notethelikelihoodoftheold(-9695.14457)andnew(-9696.03008)analysis.Inthisexample,thelikelihoodgotloweraftermodifyingthebasefrequencies.Thisindicatesthattheempiricalfrequencieswerebetter.Thisprocedurecanberepeatedanumberoftimes in order to get the best estimate.Testing molecular clockThemolecularclockassumptioncanbecrudelytestedbythetwomaximumlikelihoodprograms,DNAmlandDNAmlk.Thetest,whichcomparestwolikelihoodsiscalledlikelihood ratio test.FirsttheanalysisisrunbyprogramDNAmlk,whichproducesanunrootedtreewithmolecular clock assumption (outtree):(Frog:0.77626,(Chicken:0.63785,(Opossum:0.49812,(Human:0.15025,Rabbit:0.15025):0.34787):0.13973):0.13842);Note the probability of the tree (from outfile): -9714.14764.TheprogramDNAmlcanreadilyreadinthistreefile,whichhasfirsttobenamed"intree".Remembertoruntheanalysiswiththeusertreeoption("u")turnedon.Youshould also run the analysis without using the lenghts on the user tree.Note the probabilty of tree from the DNAml analysis: -9695.14457Nowcalculatethedifferencebetweenprobabilities:(-9714.14764(-9695.14457)=19.003. Thisdifferenceisthencompared to the chi-square distribution with df equaltothe number of sequencies. The table consists of dfs on the left side, and p-values on theupper side. The values in the crossing of the p-value and df is called the critical value ofthedistribution.P-valueistheriskthatweconcludebychancethatsequenciesdidnotevolve according to a clock. Often the p-value is set to be 0.05.Ifthedifferenceislargerthanthecriticalvaluementionedinthetablebelow,thesequencies did not evolve according to a molecular clock.34In our case, the difference was 19.003. That is larger than the critical value with df=5 andp-value=0.05,whichis11.070.Thus,weconcludethatwecanrejecttheclockhypothesis,butbychancewemightdothatin1/20casesifthesameanalysiswasrepeated (p-value of 0.05). Probability of exceeding the critical valuedf0.100.05 0.0250.01 0.00112.706 3.841 5.024 6.63510.82824.605 5.991 7.378 9.21013.81636.251 7.815 9.34811.34516.26647.779 9.48811.14313.27718.46759.23611.07012.83315.08620.5156 10.64512.59214.44916.81222.4587 12.01714.06716.01318.47524.3228 13.36215.50717.53520.09026.1259 14.68416.91919.02321.66627.877 10 15.98718.30720.48323.20929.588 11 17.27519.67521.92024.72531.264 12 18.54921.02623.33726.21732.910 13 19.81222.36224.73627.68834.528 14 21.06423.68526.11929.14136.123 15 22.30724.99627.48830.57837.697 16 23.54226.29628.84532.00039.252 17 24.76927.58730.19133.40940.790 18 25.98928.86931.52634.80542.312 19 27.20430.14432.85236.19143.820 20 28.41231.41034.17037.56645.315 21 29.61532.67135.47938.93246.797 22 30.81333.92436.78140.28948.268 23 32.00735.17238.07641.63849.728 24 33.19636.41539.36442.98051.179 25 34.38237.65240.64644.31452.620 26 35.56338.88541.92345.64254.052 27 36.74140.11343.19546.96355.476 28 37.91641.33744.46148.27856.892 29 39.08742.55745.72249.58858.301 30 40.25643.77346.97950.89259.703 31 41.42244.98548.23252.19161.098 32 42.58546.19449.48053.48662.487 33 43.74547.40050.72554.77663.870 34 44.90348.60251.96656.06165.247 35 46.05949.80253.20357.34266.619 36 47.21250.99854.43758.61967.985 37 48.36352.19255.66859.89369.347 38 49.51353.38456.89661.16270.703 39 50.66054.57258.12062.42872.055 40 51.80555.75859.34263.69173.402 41 52.94956.94260.56164.95074.745 42 54.09058.12461.77766.20676.084 43 55.23059.30462.99067.45977.419 44 56.36960.48164.20168.71078.750 45 57.50561.65665.41069.95780.077 46 58.64162.83066.61771.20181.400 47 59.77464.00167.82172.44382.720 48 60.90765.17169.02373.68384.037 49 62.03866.33970.22274.91985.351 50 63.16767.50571.42076.15486.661 51 64.29568.66972.61677.38687.96835 52 65.42269.83273.81078.61689.272 53 66.54870.99375.00279.84390.573 54 67.67372.15376.19281.06991.872 55 68.79673.31177.38082.29293.168 56 69.91974.46878.56783.51394.461 57 71.04075.62479.75284.73395.751 58 72.16076.77880.93685.95097.039 59 73.27977.93182.11787.16698.324 60 74.39779.08283.29888.37999.607 61 75.51480.23284.47689.591 100.888 62 76.63081.38185.65490.802 102.166 63 77.74582.52986.83092.010 103.442 64 78.86083.67588.00493.217 104.716 65 79.97384.82189.17794.422 105.988 66 81.08585.96590.34995.626 107.258 67 82.19787.10891.51996.828 108.526 68 83.30888.25092.68998.028 109.791 69 84.41889.39193.85699.228 111.055 70 85.52790.53195.023 100.425 112.317 71 86.63591.67096.189 101.621 113.577 72 87.74392.80897.353 102.816 114.835 73 88.85093.94598.516 104.010 116.092 74 89.95695.08199.678 105.202 117.346 75 91.06196.217 100.839 106.393 118.599 76 92.16697.351 101.999 107.583 119.850 77 93.27098.484 103.158 108.771 121.100 78 94.37499.617 104.316 109.958 122.348 79 95.476 100.749 105.473 111.144 123.594 80 96.578 101.879 106.629 112.329 124.839 81 97.680 103.010 107.783 113.512 126.083 82 98.780 104.139 108.937 114.695 127.324 83 99.880 105.267 110.090 115.876 128.565 84100.980 106.395 111.242 117.057 129.804 85102.079 107.522 112.393 118.236 131.041 86103.177 108.648 113.544 119.414 132.277 87104.275 109.773 114.693 120.591 133.512 88105.372 110.898 115.841 121.767 134.746 89106.469 112.022 116.989 122.942 135.978 90107.565 113.145 118.136 124.116 137.208 91108.661 114.268 119.282 125.289 138.438 92109.756 115.390 120.427 126.462 139.666 93110.850 116.511 121.571 127.633 140.893 94111.944 117.632 122.715 128.803 142.119 95113.038 118.752 123.858 129.973 143.344 96114.131 119.871 125.000 131.141 144.567 97115.223 120.990 126.141 132.309 145.789 98116.315 122.108 127.282 133.476 147.010 99117.407 123.225 128.422 134.642 148.230100118.498 124.342 129.561 135.807 149.449Inferring ancient states of sequence sitesWhyareweinterestedinreconstructingancestralcharacterstates?Itcanprovideimportant information about adaptive radiations and key innovations for these radiations.Oneinterestingapproachhasbeentousetheancientsequenciestoproduceanancientprotein. Then the biochemical properties of this protein could be studied and compared tothe modern proteins.36The ancestral states of sequence sites can be inferred either by parsimony or by maximumlikelihood. It is commonly thought, that the maximum likelihood method is more accuratethan the parsimony method for inferring ancient sequencies.TheancestralstatesareinferredinbothprogramsDNAparsandDNAmlbyturningontheoption"5",printsequenciesatallnodesoftreeandreconstructhypotheticalsequencies, respectively.Tree is identical for both methods: +--------------------Frog +---------------------3+--2 +-----------------Chicken|||+Human|1---------Opossum|+Rabbitmaximum likelihood methodProbable sequences at interior nodes:nodeReconstructed sequence (caps if > 0.95)1GTTCAGGACT TGGGCATAAA AGGCAGAGCA GGGCcAGCTG CTGCTTACAC2agcCAGGgCT tGGGCATAAA AGtCAGgGCA GaGCcAtCTa tTGCTTACAt3rrcrrcyrcy rscacrrgyr ygycarcrcr rrgyyakcrr yygyycarry FrogCTCAACTTTG GCCATGGGTT TGACAGCACA TGATCGTCAG CTGATCAACA Chicken GACGGCCGCT rrCACCAGCG TGCTATCCCC ACGGGAGCAA GAGCCCAGAC Human AGCCAGGGCT rGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT Opossum GTTTGGGGGC CAGGGATAAA AGGCAGAGCT AGATTAGTTT CAGCATCATA RabbitGTTCAGGACT TGGGCATAAA AGGCAGAGCA GGGCrAGCTG CTGCTTACACMaximumlikelihoodmethodwritesthesequencesiteswithover95%probabilitywithupper case, sites with 50-95% probability with lower case, and those with less than 50%probability with an ambiquity code.parsimony methodFromTo Any Steps?State at upper node1GTTCAGGGCT ?GGGCATAAA AGGCAGAGCA RGGY?AGCTD 12 yesGTCCAGGGCT -GGGCATAAA AGNCAGVGCA RGGYCAKCTR 23 yesGTCVACBGCT -?CACVDGHD TGNCAGCVCA DGGBCAKCAR 3 Frog yesCTCAACTTTG GCCATGGGTT TGACAGCACA TGATCGTCAG 3 ChickenyesGACGGCCGCT --CACCAGCG TGCTATCCCC ACGGGAGCAA 2 HumanyesAGCCAGGGCT -GGGCATAAA AGTCAGGGCA GAGCCATCTA 1 OpossumyesGTTTGGGGGC CAGGGATAAA AGGCAGAGCT AGATTAGTTT 1 Rabbit yesGTTCAGGACT TGGGCATAAA AGGCAGAGCA GGGC-AGCTG371CTGCTTAMAM 12 no CTGCTTAMAM 23 yesCTGCTCAVAM 3 Frog yesCTGATCAACA 3 ChickenyesGAGCCCAGAC 2 HumanyesTTGCTTACAT 1 OpossumyesCAGCATCATA 1 Rabbitmaybe CTGCTTACACIftheparsimonyinferredstateis"?"oranambiquitycode,therearemultipleequallyparsimonious states; the user has to work these out by hand. In addition, "?" means that adeletion might or might not have happened.Statistical tests of treesStatistical tests can be performed for both parsimony and maximum likelihood trees. Thetestsareperformedbyputtingmultipletreesinthe"intree"file.Actually,thetestsareautomatically performed, if there are multiple trees in the "intree" file: 5(((Frog,Chicken),Human),Opossum,Rabbit);(((Frog,Human),Chicken),Opossum,Rabbit);(((Frog,Opossum),Human),Chicken,Rabbit);(((Frog,Chicken),Rabbit),Opossum,Human);(((Frog,Rabbit),Human),Opossum,Chicken);Parsimony program DNApars performs a modified Templeton test. It is closely parallel toa test using log likelihood differences, and uses the mean and variance of step differencesbetweentrees,takenacrosssites.Ifthemeanismorethan1.96standarddeviationsdifferent then the trees are declared significantly different. The program prints out a tableofthestepsforeachtree,thedifferencesofeachfromthebestone,thevarianceofthatquantityasdeterminedbythestepdifferencesatindividualsites,andaconclusionastowhether that tree is or is not significantly worse than the best one.This test finds the best tree among the competing hypothesis. However, the results are notcorrectedformultiplecomparisons,andtheresultsneedstilltobeinterpretedwithcausion.TreeSteps Diff Steps Its S.D. Significantly worse?13134.038.0 18.2807 Yes23194.098.0 15.3665 Yes33096.0 font1Font loaded.Unrooted tree plotting program version 3.6Here are the settings: 0Screen type (IBM PC, ANSI)?ANSI P Final plotting device:MS-Windows Bitmap V Previewing device:X Windows display BUse branch lengths:Yes L Angle of labels:branch points to Middle of label RRotation of tree:90.0 A Angle of arc for tree:360.0 I Iterate to improve tree:Equal-Daylight algorithm DTry to avoid label overlap?No SScale of branch length:Automatically rescaled C Relative character height:0.3333 FFont:Times-Roman MHorizontal margins:1.65 cm MVertical margins:2.16 cm # Page size submenu:one page per tree Y to accept these or type the letter for one to changeyWe now will preview the tree.The box that will beplotted on the screen represents the boundary of thefinal plotting surface.To see the preview, press onthe ENTER or RETURN key (you may need to do it twice).When finished viewing it, press on that key again.Is the tree ready to be plotted? (Answer Y or N)Type Y or N:YWriting plot file ...Done.Note above that the Final plotting device was changed to be MS-Windows Bitmap.53GlossaryAmbiquity 6Ancient states 14, 35Average percent standard deviation 12Base frequencies 10Base frequency heterogenity 38Branch-and-bound 15Bootstrapping 23Categories 13Chi-square distribution 33Consense 25Df 33Distance matrix 10DNAdist 9DNAml 19DNAmlk 20DNApars 14DNApenny 15Drawgram 26Drawtree 28F84 9Estimated branch lenghts 12Fitch 11Fitch-Margoliash 11Gamma corrected distances 44Gamma distribution 9, 44Genetic code 18Global rearrangements 19Global search 19Heuristic algorithm 15Hidden Markov Model (HMM) 44How many replicates 23Jackknifing 23Jukes-Cantor 9Jumble 11Kimura 9, 13Kishino-Hasegawa test 38Likelihood ratio test 33LogDet 9, 38Majority rule 25Multiple comparisons 37Multiple datasets 24Multiple outgroups 44Neighbor 1154Newick format 4Number of trees to save 14Outgroup 11PAM 13Parsimony informative positions 42Permutation 23Protdist 13Proml 22P-value 33Rate categories 10Retree 29Rooted tree 28Saturation of substitutions 42Search 14Seqboot 22Sequence alignment 5Clustal-format, .aln 5phylip-format, .phy 5Speedier but rougher analysis 19Strict consensus 25Sum of squares 12Taxonomical relationships 40Templeton test 37Threshold parsimony 14Topological distances 39Transition / transversion ratio 9, 30Tranversion parsimony 14Treedist 39Unrooted tree 26User defined tree 12User interphase 4Weigths 10, 40

phylip

Documents

dna data

protein data

data files

data flow charts

actual data analysis

running phylip programs

essential programs

text file