bls 303 l1.phylogenetics

BLS 303: Principles of Computational Biology

Lecture 1: Molecular Phylogenetics

Topics

• i. Molecular Evolution• ii. Calculating Distances• iii. Clustering Algorithms• iv. Cladistic Methods• v. Computer Software

• The theory of evolution is thefoundation upon which all ofmodern biology is built.

Evolution

• From anatomy to behavior to genomics, thescientific method requires an appreciation ofchanges in organisms over time.

• It is impossible to evaluate relationships amonggene sequences without taking into considerationthe way these sequences have been modified overtime

Similarity searches and multiple alignments ofsequences naturally lead to the question:

“How are these sequences related?”and more generally:

“How are the organisms from whichthese sequences come related?”

Relationships

Taxonomy

• The study of the relationships between groups oforganisms is called taxonomy, an ancient andvenerable branch of classical biology.

• Taxonomy is the art of classifying things intogroups — a quintessential human behavior —established as a mainstream scientific field byCarolus Linnaeus (1707-1778).

hamza msangi

Highlight

hamza msangi

Highlight

Phylogenetics• Evolutionary theory states that groups of similar

organisms are descended from a common ancestor.

• Phylogenetic systematics (cladistics) is a methodof taxonomic classification based on theirevolutionary history.

• It was developed by Willi Hennig,a German entomologist, in 1950.

hamza msangi

Highlight

hamza msangi

Highlight

Cladistic Methods• Evolutionary relationships are documented by

creating a branching structure, termed a phylogenyor tree, that illustrates the relationships between thesequences.

• Cladistic methods construct a tree (cladogram) byconsidering the various possible pathways ofevolution and choose from among these the bestpossible tree.

• A phylogram is a tree with branches that areproportional to evolutionary distances.

hamza msangi

Highlight

hamza msangi

Highlight

hamza msangi

Highlight

hamza msangi

Highlight

Molecular Evolution

• Phylogenetics often makes use of numerical data,(numerical taxonomy) which can be scores forvarious “character states” such as the size of avisible structure or it can be DNA sequences.

• Similarities and differences between organisms canbe coded as a set of characters, each with two ormore alternative character states.

• In an alignment of DNA sequences, each positionis a separate character, with four possible characterstates, the four nucleotides.

DNA is a good tool for taxonomy

DNA sequences have many advantagesover classical types of taxonomiccharacters:– Character states can be scored unambiguously– Large numbers of characters can be scored for

each individual– Information on both the extent and the nature of

divergence between sequences is available(nucleotide substitutions, insertion/deletions, orgenome rearrangements)

hamza msangi

Highlight

hamza msangi

Highlight

hamza msangi

Highlight

hamza msangi

Highlight

hamza msangi

Highlight

A aat tcg ctt cta gga atc tgc ctaatc ctgB ... ..a ..g ..a .t. ... ... t..... ..aC ... ..a ..c ..c ... ..t ... ...... t.aD ... ..a ..a ..g ..g ..t ... t.t..t t..Each nucleotide difference is a character

• After working with sequences for a while, one develops anintuitive understanding that “for a given gene, closely relatedorganisms have similar sequences and more distantly relatedorganisms have more dissimilar sequences. Thesedifferences can be quantified”.

• Given a set of gene sequences, it should be possible toreconstruct the evolutionary relationships among genesand among organisms.

Sequences Reflect Relationships

What Sequences to Study?• Different sequences accumulate changes at

different rates - chose level of variation that isappropriate to the group of organisms beingstudied.– Proteins (or protein coding DNAs) are constrained by

natural selection - better for very distant relationships– Some sequences are highly variable (rRNA spacer

regions, immunoglobulin genes), while others arehighly conserved (actin, rRNA coding regions)

– Different regions within a single gene can evolve atdifferent rates (conserved vs. variable domains)

hamza msangi

Highlight

A

A B

A2 B2A1 B1

Duplication

Speciation

(globin)

(hemoglobin) (myoglobin)

(mouse) (human)

Ancestral gene

Orthologs vs. Paralogs• When comparing gene sequences, it is important

to distinguish between identical vs. merely similargenes in different organisms.

• Orthologs are homologous genes in differentspecies with analogous functions.

• Paralogs are similar genes that are the result of agene duplication.– A phylogeny that includes both orthologs and paralogs

is likely to be incorrect.– Sometimes phylogenetic analysis is the best way to

determine if a new gene is an ortholog or paralog toother known genes.

hamza msangi

Highlight

Terminologies of phylogeny• Phylogenetic (binary) tree: A tree is a graph composed of

nodes and branches, in which any two nodes are connectedby a unique path.

• Nodes: Nodes in phylogenetic trees are called taxonomicunits (TUs) Usually, taxonomic units are represented bysequences (DNA or RNA nucleotides or amino acids).

• Branches: Branches in phylogenetic trees indicatedescent/ancestry relationships among the TUs.

• Terminal (external) nodes: The terminal nodes are alsocalled the external nodes, leaves, or tips of the tree and arealso called extant taxonomic units or operational taxonomicunits (OTUs)

Terminologies of phylogeny

• Internal nodes: The internal nodes are nodes, which arenot terminal. They are also called ancestral TUs.

• Root: The root is a node from which a unique path leads toany other node, in the direction of evolutionary time. Theroot is the common ancestor of all TU’s under study.

• Topology: The topology is the branching pattern of a tree.

• Branch length: The lengths of the branches determine themetrics of a tree. In phylogenetic trees, lengths of branchesare measured in units of evolutionary time.

Example of phylogenetic tree: VP1 gene for FMDV

Genes vs. Species

• Relationships calculated from sequence data representthe relationships between genes, this is not necessarilythe same as relationships between species.

• Your sequence data may not have the samephylogenetic history as the species from which theywere isolated.

• Different genes evolve at different speeds, and there isalways the possibility of horizontal gene transfer(hybridization, vector mediated DNA movement, ordirect uptake of DNA).

Cladistic vs. Phenetic

Within the field of taxonomy there are twodifferent methods and philosophies of buildingphylogenetic trees: cladistic and phenetic– Phenetic methods construct trees (phenograms) by

considering the current states of characters withoutregard to the evolutionary history that brought thespecies to their current phenotypes.

– Cladistic methods rely on assumptions aboutancestral relationships as well as on current data.

Phenetic Methods• Computer algorithms based on the phenetic model rely on

Distance Methods to build of trees from sequence data.• Phenetic methods count each base of sequence difference

equally, so a single event that creates a large change insequence (insertion/deletion or recombination) will move twosequences far apart on the final tree.

• Phenetic approaches generally lead to faster algorithms andthey often have nicer statistical properties for molecular data.

• The phenetic approach is popular with molecularevolutionists because it relies heavily on objective characterdata (such as sequences) and it requires relatively fewassumptions.

hamza msangi

Highlight

Cladistic Methods

• For character data about the physical traits oforganisms (such as morphology of organs etc.)and for deeper levels of taxonomy, the cladisticapproach is almost certainly superior.

• Cladistic methods are often difficult toimplement with molecular data because all ofthe assumptions are generally not satisfied.

Distances Measurements• It is often useful to measure the genetic distance between

two species, between two populations, or even betweentwo individuals.

• The entire concept of numerical taxonomy is based oncomputing phylogenies from a table of distances.

• In the case of sequence data, pairwise distances must becalculated between all sequences that will be used to buildthe tree - thus creating a distance matrix.

• Distance methods give a single measurement of theamount of evolutionary change between two sequencessince divergence from a common ancestor.

DNA Distances

• Distances between pairs of DNA sequences are relativelysimple to compute as the sum of all base pair differencesbetween the two sequences.– this type of algorithm can only work for pairs of sequences that are

similar enough to be aligned

• Generally all base changes are considered equal• Insertion/deletions are generally given a larger weight than

replacements (gap penalties).• It is also possible to correct for multiple substitutions at a

single site, which is common in distant relationships andfor rapidly evolving sites.

Amino Acid Distances

• Distances between amino acid sequences are a bit morecomplicated to calculate.

• Some amino acids can replace one another with relatively littleeffect on the structure and function of the final protein whileother replacements can be functionally devastating.

• From the standpoint of the genetic code, some amino acidchanges can be made by a single DNA mutation while othersrequire two or even three changes in the DNA sequence.

• In practice, what has been done is to calculate tables offrequencies of all amino acid replacements within families ofrelated protein sequences in the databanks: i.e. PAM andBLOSSUM

The PAM 250 scoring matrixA R N D C Q E G H I L K M F P S T W Y V

A 2 R -2 6 N 0 0 2 D 0 -1 2 4

C -2 -4 4 -5 4 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of ProteinSequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation,Silver Spring, MD.

Clustering Algorithms

Clustering algorithms use distances to calculatephylogenetic trees. These trees are based solely onthe relative numbers of similarities and differencesbetween a set of sequences.

– Start with a matrix of pairwise distances

– Cluster methods construct a tree by linking the leastdistant pairs of taxa, followed by successively moredistant taxa.

UPGMA

• The simplest of the distance methods is the UPGMA(Unweighted Pair Group Method using Arithmetic averages)

• The PHYLIP programs DNADIST and PROTDISTcalculate absolute pairwise distances between a group ofsequences. Then the GCG program GROWTREE usesUPGMA to build a tree.

• Many multiple alignment programs such as PILEUP use avariant of UPGMA to create a dendrogram of DNAsequences which is then used to guide the multiple alignmentalgorithm.

Neighbor Joining

• The Neighbor Joining method is the most popularway to build trees from distance measurements

(Saitou and Nei 1987, Mol. Biol. Evol. 4:406)

– Neighbor Joining corrects the UPGMA method for its (frequentlyinvalid) assumption that the same rate of evolution applies to eachbranch of a tree.

– The distance matrix is adjusted for differences in the rate ofevolution of each taxon (branch).

– Neighbor Joining has given the best results in simulation studiesand it is the most computationally efficient of the distancealgorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)

Cladistic methods

• Cladistic methods are based on the assumption that aset of sequences evolved from a common ancestor bya process of mutation and selection without mixing(hybridization or other horizontal gene transfers).

• These methods work best if a specific tree, or at leastan ancestral sequence, is already known so thatcomparisons can be made between a finite number ofalternate trees rather than calculating all possible treesfor a given set of sequences.

Parsimony

• Parsimony is the most popular method forreconstructing ancestral relationships.

• Parsimony allows the use of all known evolutionaryinformation in building a tree

– In contrast, distance methods compress all of thedifferences between pairs of sequences into a singlenumber

Building Trees with Parsimony

• Parsimony involves evaluating all possible treesand giving each a score based on the number ofevolutionary changes that are needed to explainthe observed data.

• The best tree is the one that requires the fewestbase changes for all sequences to derive from acommon ancestor.

Parsimony Example

• Consider four sequences: ATCG, TTCG,ATCC, and TCCG

• Imagine a tree that branches at the firstposition, grouping ATCG and ATCC onone branch, TTCG and TCCG on the otherbranch.

• Then each branch splits, for a total of 3nodes on the tree (Tree #1)

Tree#1

Tree#2

Compare Tree #1 with one that first divides ATCC on its ownbranch, then splits off ATCG, and finally divides TTCG fromTCCG (Tree #2).

Trees #1 and #2 both have three nodes, but when all of thedistances back to the root (# of nodes crossed) are summed,the total is equal to 8 for Tree #1 and 9 for Tree #2.

Maximum Likelihood

• The method of Maximum Likelihood attempts toreconstruct a phylogeny using an explicit model ofevolution.

• This method works best when it is used to test (orimprove) an existing tree.

• Even with simple models of evolutionary change,the computational task is enormous, making thisthe slowest of all phylogenetic methods.

Assumptions for Maximum Likelihood

• The frequencies of DNA transitions (C<->T,A<->G) andtransversions (C or T<->A or G).

• The assumptions for protein sequence changes are takenfrom the PAM matrix - and are quite likely to be violated in“real” data.

• Since each nucleotide site evolves independently, the tree iscalculated separately for each site. The product of thelikelihood's for each site provides the overall likelihood ofthe observed data.

Computer Software for Phylogenetics

Due to the lack of consensus among evolutionary biologistsabout basic principles for phylogenetic analysis, it is notsurprising that there is a wide array of computer softwareavailable for this purpose.– PHYLIP is a free package that includes 30 programs that compute

various phylogenetic algorithms on different kinds of data.– The GCG package (available at most research institutions) contains

a full set of programs for phylogenetic analysis including simpledistance-based clustering and the complex cladistic analysisprogram PAUP (Phylogenetic Analysis Using Parsimony)

– CLUSTALX is a multiple alignment program that includes theability to create trees based on Neighbor Joining.

– DNAStar– MacClade is a well designed cladistics program that allows the user

to explore possible trees for a data set.

Phylogenetics on the Web• There are several phylogenetics servers available

on the Web– some of these will change or disappear in the near future– these programs can be very slow so keep your sample sets small

• The Institut Pasteur, Paris has a PHYLIP server at:http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html

• Louxin Zhang at the Natl. University of Singapore has a WebPhylip server:http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/

• The Belozersky Institute at Moscow State University has their own"GeneBee" phylogenetics server:

http://www.genebee.msu.su/services/phtree_reduced.html• The Phylodendron website is a tree drawing program with a nice user

interface and a lot of options, however, the output is limited to gifs at72 dpi - not publication quality.

http://iubio.bio.indiana.edu/treeapp/treeprint-form.html

Other Web Resources

• Joseph Felsenstein (author of PHYLIP) maintains acomprehensive list of Phylogeny programs at:

http://evolution.genetics.washington.edu/phylip/software.html

• Introduction to Phylogenetic Systematics,Peter H. Weston & Michael D. Crisp, Society of Australian SystematicBiologistshttp://www.science.uts.edu.au/sasb/WestonCrisp.html

• University of California, Berkeley Museum ofPaleontology (UCMP)http://www.ucmp.berkeley.edu/clad/clad4.html

Software Hazards• There are a variety of programs for Macs and PCs,

but you can easily tie up your machine for manyhours with even moderately sized data sets (i.e.fifty 300 bp sequences)

• Moving sequences into different programs can bea major hassle due to incompatible file formats.

• Just because a program can perform a givencomputation on a set of data does not mean thatthat is the appropriate algorithm for that type ofdata.

ConclusionsGiven the huge variety of methods for computingphylogenies, how can the biologist determine whatis the best method for analyzing a given data set?– Published papers that address phylogenetic issues generally

make use of several different algorithms and data sets in orderto support their conclusions.

– In some cases different methods of analysis can worksynergistically

• Neighbor Joining methods generally produce just one tree, which canhelp to validate a tree built with the parsimony or maximum likelihoodmethod

– Using several alternate methods can give an indication of therobustness of a given conclusion.

bls 303 l1.phylogenetics

Technology

different sequences

taxonomydna sequences

whichthese sequences

dissimilar sequences

sequences reect relationships

set of gene sequences

alignment of dna sequences

terminal nodes