Download - Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner [email protected]
![Page 2: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/2.jpg)
Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby J Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by Wiley.
These images and materials may not be usedwithout permission from the publisher.
Visit http://www.bioinfbook.org
Copyright notice
![Page 3: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/3.jpg)
plants
animals
monera
fungi
protistsprotozoa
invertebrates
vertebrates
mammalsFive kingdom
system(Haeckel, 1879)
Page 396
![Page 4: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/4.jpg)
Introduction to evolution and phylogeny
Nomenclature of trees
Five stages of molecular phylogeny:[1] selecting sequences[2] multiple sequence alignment[3] models of substitution[4] tree-building[5] tree evaluation
Goals of the lecture
![Page 5: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/5.jpg)
Charles Darwin’s 1859 book (On the Origin of SpeciesBy Means of Natural Selection, or the Preservationof Favoured Races in the Struggle for Life) introducedthe theory of evolution.
To Darwin, the struggle for existence induces a naturalselection. Offspring are dissimilar from their parents(that is, variability exists), and individuals that are morefit for a given environment are selected for. In this way,over long periods of time, species evolve. Groups of organisms change over time so that descendants differstructurally and functionally from their ancestors.
Introduction
Page 357
![Page 6: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/6.jpg)
At the molecular level, evolution is a process ofmutation with selection.
Molecular evolution is the study of changes in genesand proteins throughout different branches of the tree of life.
Phylogeny is the inference of evolutionary relationships.Traditionally, phylogeny relied on the comparisonof morphological features between organisms. Today,molecular sequence data are also used for phylogeneticanalyses.
Introduction
Page 358
![Page 7: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/7.jpg)
Studies of molecular evolution began with the firstsequencing of proteins, beginning in the 1950s.
In 1953 Frederick Sanger and colleagues determinedthe primary amino acid sequence of insulin.
(The accession number of human insulin is NP_000198)
Historical background
Page 358
![Page 8: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/8.jpg)
Fig. 11.1Page 359
Mature insulin consists of an A chain and B chainheterodimer connected by disulphide bridges
The signal peptide and C peptide are cleaved,and their sequences display fewerfunctional constraints.
![Page 9: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/9.jpg)
Fig. 11.1Page 359
![Page 10: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/10.jpg)
Fig. 11.1Page 359
Note the sequence divergence in the disulfide loop region of the A chain
![Page 11: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/11.jpg)
By the 1950s, it became clear that amino acid substitutions occur nonrandomly. For example, Sanger and colleagues noted that most amino acid changes in the insulin A chain are restricted to a disulfide loop region.Such differences are called “neutral” changes(Kimura, 1968; Jukes and Cantor, 1969).
Subsequent studies at the DNA level showed that rate ofnucleotide (and of amino acid) substitution is about six-to ten-fold higher in the C peptide, relative to the A and Bchains.
Historical background: insulin
Page 358
![Page 12: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/12.jpg)
Fig. 11.1Page 359
Number of nucleotide substitutions/site/year
0.1 x 10-9
0.1 x 10-91 x 10-9
![Page 13: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/13.jpg)
Surprisingly, insulin from the guinea pig (and from the related coypu) evolve seven times faster than insulinfrom other species. Why?
The answer is that guinea pig and coypu insulindo not bind two zinc ions, while insulin molecules frommost other species do. There was a relaxation on thestructural constraints of these molecules, and so the genes diverged rapidly.
Historical background: insulin
Page 360
![Page 14: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/14.jpg)
Fig. 11.1Page 359
Guinea pig and coypu insulin have undergone anextremely rapid rate of evolutionary change
Arrows indicate positions at which guinea pig insulin (A chain and B chain) differs from both human and mouse
![Page 15: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/15.jpg)
In the 1960s, sequence data were accumulated forsmall, abundant proteins such as globins,cytochromes c, and fibrinopeptides. Some proteinsappeared to evolve slowly, while others evolved rapidly.
Linus Pauling, Emanuel Margoliash and others proposed the hypothesis of a molecular clock:
For every given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages
Molecular clock hypothesis
Page 360
![Page 16: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/16.jpg)
As an example, Richard Dickerson (1971) plotted datafrom three protein families: cytochrome c, hemoglobin, and fibrinopeptides.
The x-axis shows the divergence times of the species,estimated from paleontological data. The y-axis showsm, the corrected number of amino acid changes per 100 residues.
n is the observed number of amino acid changes per100 residues, and it is corrected to m to account forchanges that occur but are not observed.
Molecular clock hypothesis
Page 360
N100
= 1 – e-(m/100)
![Page 17: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/17.jpg)
Fig. 11.3Page 361Millions of years since divergence
corr
ecte
d a
min
o a
cid
ch
ang
es
per
100
res
idu
es (
m)
Dickerson (1971)
![Page 18: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/18.jpg)
Dickerson drew the following conclusions:
• For each protein, the data lie on a straight line. Thus, the rate of amino acid substitution has remained constant for each protein.
• The average rate of change differs for each protein. The time for a 1% change to occur between two lines of evolution is 20 MY (cytochrome c), 5.8 MY (hemoglobin), and 1.1 MY (fibrinopeptides).
• The observed variations in rate of change reflect functional constraints imposed by natural selection.
Molecular clock hypothesis: conclusions
Page 361
![Page 19: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/19.jpg)
If protein sequences evolve at constant rates,they can be used to estimate the times that species diverged. This is analogous to datinggeological specimens by radioactive decay.
Molecular clock hypothesis: implications
Page 362
![Page 20: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/20.jpg)
Darwin’s theory of evolution suggests that, at the phenotypic level, traits in a population that enhance survival are selected for, while traits that reduce fitness are selected against. For example, among a group of giraffes millions of years in the past, those giraffes that had longer necks were able to reach higher foliage and were more reproductively successful than their shorter-necked group members, that is, the taller giraffes were selected for.
In the mid-20th century, a conventional view was that molecular sequences are routinely subject to positive (or negative) selection.
Positive and negative selection
![Page 21: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/21.jpg)
Darwin’s theory of evolution suggests that, at the phenotypic level, traits in a population that enhance survival are selected for, while traits that reduce fitness are selected against. For example, among a group of giraffes millions of years in the past, those giraffes that had longer necks were able to reach higher foliage and were more reproductively successful than their shorter-necked group members, that is, the taller giraffes were selected for.
Positive selection occurs when a sequence undergoes significantly increased rates of substitution, while negative selection occurs when a sequence undergoes change slowly. Otherwise, selection is neutral.
Positive and negative selection
![Page 22: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/22.jpg)
Tajima’s relative rate test in MEGA
![Page 23: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/23.jpg)
Tajima’s relative rate test
![Page 24: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/24.jpg)
An often-held view of evolution is that just as organismspropagate through natural selection, so also DNA andprotein molecules are selected for.
According to Motoo Kimura’s 1968 neutral theoryof molecular evolution, the vast majority of DNAchanges are not selected for in a Darwinian sense.The main cause of evolutionary change is randomdrift of mutant alleles that are selectively neutral(or nearly neutral). Positive Darwinian selection doesoccur, but it has a limited role.
As an example, the divergent C peptide of insulinchanges according to the neutral mutation rate.
Neutral theory of evolution
Page 363
![Page 25: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/25.jpg)
Phylogeny can answer questions such as:
Goals of molecular phylogeny
• How many genes are related to my favorite gene?• Was the extinct quagga more like a zebra or a horse?• Was Darwin correct that humans are closest to chimps and gorillas?• How related are whales, dolphins & porpoises to cows?• Where and when did HIV originate?• What is the history of life on earth?
![Page 26: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/26.jpg)
Was the quagga (now extinct) more like a zebra or a horse?
![Page 27: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/27.jpg)
Introduction to evolution and phylogeny
Nomenclature of trees
Five stages of molecular phylogeny:[1] selecting sequences[2] multiple sequence alignment[3] models of substitution[4] tree-building[5] tree evaluation
Goals of the lecture
![Page 28: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/28.jpg)
There are two main kinds of information inherentto any tree: topology and branch lengths.
We will now describe the parts of a tree.
Molecular phylogeny: nomenclature of trees
Page 366
![Page 29: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/29.jpg)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
2
1
2
D
Eone unit
Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are basedupon DNA and protein sequence data.
Fig. 11.4Page 366
![Page 30: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/30.jpg)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
2
1
2
D
Eone unit
Tree nomenclature
taxon
taxon
Fig. 11.4Page 366
![Page 31: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/31.jpg)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
2
1
2
D
Eone unit
Tree nomenclature
taxon
operational taxonomic unit (OTU) such as a protein sequence
Fig. 11.4Page 366
![Page 32: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/32.jpg)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
2
1
2
D
Eone unit
Tree nomenclature
branch (edge)
Node (intersection or terminating pointof two or more branches)
Fig. 11.4Page 366
![Page 33: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/33.jpg)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
2
1
2
D
Eone unit
Tree nomenclature
Branches are unscaled... Branches are scaled...
…branch lengths areproportional to number ofamino acid changes
…OTUs are neatly aligned,and nodes reflect time
Fig. 11.4Page 366
![Page 34: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/34.jpg)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
22
D
Eone unit
Tree nomenclature
bifurcatinginternal node
multifurcatinginternalnode
Fig. 11.5Page 367
![Page 35: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/35.jpg)
Examples of multifurcation: failure to resolve the branching orderof some metazoans and protostomes
Rokas A. et al., Animal Evolution and the Molecular Signature of RadiationsCompressed in Time, Science 310:1933 (2005), Fig. 1.
![Page 36: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/36.jpg)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
Tree nomenclature: clades
Clade ABF (monophyletic group)
Fig. 11.4Page 366
![Page 37: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/37.jpg)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
Tree nomenclature
Clade CDH
Fig. 11.4Page 366
![Page 38: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/38.jpg)
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
Tree nomenclature
Clade ABF/CDH/G
Fig. 11.4Page 366
![Page 39: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/39.jpg)
Examples of clades
Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10
![Page 40: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/40.jpg)
The root of a phylogenetic tree represents thecommon ancestor of the sequences. Some treesare unrooted, and thus do not specify the commonancestor.
A tree can be rooted using an outgroup (that is, ataxon known to be distantly related from all otherOTUs).
Tree roots
Page 368
![Page 41: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/41.jpg)
Tree nomenclature: roots
past
present
1
2 3 4
5
6
7 8
9
4
5
87
1
2
36
Rooted tree(specifies evolutionarypath)
Unrooted tree
Fig. 11.6Page 368
![Page 42: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/42.jpg)
Tree nomenclature: outgroup rooting
past
present
1
2 3 4
5
6
7 8
9
Rooted tree
1
2 3 4
5 6
Outgroup(used to place the root)
7 9
10
root
8
Fig. 11.6Page 368
![Page 43: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/43.jpg)
Cavalii-Sforza and Edwards (1967) derived the numberof possible unrooted trees (NU) for n OTUs (n > 3):
NU =
The number of bifurcating rooted trees (NR)
NR =
For 10 OTUs (e.g. 10 DNA or protein sequences),the number of possible rooted trees is 34 million,and the number of unrooted trees is 2 million.Many tree-making algorithms can exhaustively examine every possible tree for up to ten to twelvesequences.
Enumerating trees
Page 368
(2n-5)!2n-3(n-3)!
(2n-3)!2n-2(n-2)!
![Page 44: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/44.jpg)
Numbers of trees
Number Number of Number of of OTUs rooted trees unrooted trees
2 1 13 3 14 15 35 105 1510 34,459,425 10520 8 x 1021 2 x 1020
Box 11-2Page 369
![Page 45: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/45.jpg)
Molecular evolutionary studies can be complicatedby the fact that both species and genes evolve.speciation usually occurs when a species becomesreproductively isolated. In a species tree, eachinternal node represents a speciation event.
Genes (and proteins) may duplicate or otherwise evolvebefore or after any given speciation event. The topologyof a gene (or protein) based tree may differ from thetopology of a species tree.
Species trees versus gene/protein trees
Page 370
![Page 46: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/46.jpg)
species 1 species 2
speciationevent
Species trees versus gene/protein trees
Fig. 11.9Page 372
past
present
![Page 47: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/47.jpg)
species 1 species 2
speciationevent
Species trees versus gene/protein trees
Gene duplicationevents
Fig. 11.9Page 372
![Page 48: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/48.jpg)
species 1 species 2
speciationevent
Species trees versus gene/protein trees
Gene duplicationevents
Fig. 11.9Page 372
OTUs
![Page 49: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/49.jpg)
Introduction to evolution and phylogeny
Nomenclature of trees
Five stages of molecular phylogeny:[1] selecting sequences[2] multiple sequence alignment[3] models of substitution[4] tree-building[5] tree evaluation
Goals of the lecture
![Page 50: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/50.jpg)
For some phylogenetic studies, it may be preferableto use protein instead of DNA sequences.
We saw that in pairwise alignment and in BLAST searching, protein is often more informative than DNA (Chapter 3). Proteins have 20 states (amino acids) instead of only four for DNA, so there is a stronger phylogenetic signal.
Stage 1: Use of DNA, RNA, or protein
Page 371
![Page 51: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/51.jpg)
For phylogeny, DNA can be more informative.
--The protein-coding portion of DNA has synonymousand nonsynonymous substitutions. Thus, some DNAchanges do not have corresponding protein changes.
Stage 1: Use of DNA, RNA, or protein
Page 371
![Page 52: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/52.jpg)
Fig. 11.10Page 373
![Page 53: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/53.jpg)
For phylogeny, DNA can be more informative.
--The protein-coding portion of DNA has synonymousand nonsynonymous substitutions. Thus, some DNAchanges do not have corresponding protein changes.
If the synonymous substitution rate (dS) is greater thanthe nonsynonymous substitution rate (dN), the DNAsequence is under negative (purifying) selection. Thislimits change in the sequence (e.g. insulin A chain).
If dS < dN, positive selection occurs. For example, a duplicated gene may evolve rapidly to assume new functions.
Stage 1: Use of DNA, RNA, or protein
Page 372
![Page 54: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/54.jpg)
You can measure the synonymous and nonsynonymous substitution rates by pasting your fasta-formatted sequences into the SNAP program at the Los Alamos National Labs HIV database (hiv-web.lanl.gov/).
Stage 1: Use of DNA, RNA, or protein
![Page 55: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/55.jpg)
For phylogeny, DNA can be more informative.
--Some substitutions in a DNA sequence alignment canbe directly observed: single nucleotide substitutions,sequential substitutions, coincidental substitutions.
Stage 1: Use of DNA, RNA, or protein
Page 372
![Page 56: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/56.jpg)
Fig. 11.11Page 374
![Page 57: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/57.jpg)
Fig. 11.11Page 374
![Page 58: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/58.jpg)
For phylogeny, DNA can be more informative.
--Some substitutions in a DNA sequence alignment canbe directly observed: single nucleotide substitutions,sequential substitutions, coincidental substitutions.
Additional mutational events can be inferred byanalysis of ancestral sequences. These changesinclude parallel substitutions, convergent substitutions,and back substitutions.
Stage 1: Use of DNA, RNA, or protein
Page 372
![Page 59: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/59.jpg)
For phylogeny, DNA can be more informative.
--Noncoding regions (such as 5’ and 3’ untranslatedregions) may be analyzed using molecular phylogeny.
--Pseudogenes (nonfunctional genes) are studied bymolecular phylogeny
--Rates of transitions and transversions can be measured. Transitions: purine (A G) or pyrimidine (C T) substitutionsTransversion: purine pyrimidine
Stage 1: Use of DNA, RNA, or protein
Page 372
![Page 60: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/60.jpg)
Models of nucleotide substitution
A G
C T
transition
transition
transversiontransversion
Fig. 11.14Page 379
![Page 61: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/61.jpg)
MEGA outputs transition and transversion frequencies
![Page 62: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/62.jpg)
MEGA outputs transition and transversion frequencies
For primate mitochondrial DNA, the ratio of transitions to transversions is particularly high
![Page 63: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/63.jpg)
Introduction to evolution and phylogeny
Nomenclature of trees
Five stages of molecular phylogeny:[1] selecting sequences[2] multiple sequence alignment[3] models of substitution[4] tree-building[5] tree evaluation
Goals of the lecture
![Page 64: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/64.jpg)
The fundamental basis of a phylogenetic tree isa multiple sequence alignment.
(If there is a misalignment, or if a nonhomologoussequence is included in the alignment, it will stillbe possible to generate a tree.)
Consider the following alignment of 13 orthologousretinol-binding proteins.
Stage 2: Multiple sequence alignment
Page 375
![Page 65: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/65.jpg)
Fig. 11.13Page 376
![Page 66: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/66.jpg)
Fig. 11.13Page 376
Some positions of the multiple sequence alignment areinvariant (arrow 2). Some positions distinguish fish RBPfrom all other RBPs (arrow 3).
![Page 67: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/67.jpg)
[1] Confirm that all sequences are homologous
[2] Adjust gap creation and extension penalties as needed to optimize the alignment
[3] Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data).
[4] Many experts recommend that you delete any column of an alignment that contains gaps (even if the gap occurs in only one taxon)
In this example, note that four RBPs are from fish, while the others are vertebrates that evolved more recently.
Stage 2: Multiple sequence alignment
Page 375
![Page 68: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/68.jpg)
Introduction to evolution and phylogeny
Nomenclature of trees
Five stages of molecular phylogeny:[1] selecting sequences[2] multiple sequence alignment[3] models of substitution[4] tree-building[5] tree evaluation
Goals of the lecture
![Page 69: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/69.jpg)
Stage 3: Tree-building models: distance
Page 378
The simplest approach to measuring distances between sequences is to align pairs of sequences, andthen to count the number of differences. The degree ofdivergence is called the Hamming distance. For analignment of length N with n sites at which there aredifferences, the degree of divergence D is:
D = n / N
![Page 70: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/70.jpg)
Stage 3: Tree-building models: distance
Page 378
The simplest approach to measuring distances between sequences is to align pairs of sequences, andthen to count the number of differences. The degree ofdivergence is called the Hamming distance. For analignment of length N with n sites at which there aredifferences, the degree of divergence D is:
D = n / N
But observed differences do not equal genetic distance!Genetic distance involves mutations that are notobserved directly (see earlier figure).
![Page 71: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/71.jpg)
Stage 3: Tree-building models: distance
Page 379
Jukes and Cantor (1969) proposed a corrective formula:
D = (- ) ln (1 – p)34
43
This model describes the probability that one nucleotidewill change into another. It assumes that each residue is equally likely to change into any other (i.e. the rate oftransversions equals the rate of transitions). In practice,the transition is typically greater than the transversionrate.
![Page 72: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/72.jpg)
Models of nucleotide substitution
A G
C T
transition
transition
transversiontransversion
Fig. 11.14Page 379
![Page 73: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/73.jpg)
A
Jukes and Cantor one-parameter model of nucleotide substitution (=)
G
T C
Fig. 11.14Page 379
![Page 74: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/74.jpg)
A
Kimura model of nucleotide substitution (assumes ≠ )
G
T C
Fig. 11.14Page 379
![Page 75: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/75.jpg)
Stage 3: Tree-building models: distance
Page 379
Jukes and Cantor (1969) proposed a corrective formula:
D = (- ) ln (1 – p)34
43
![Page 76: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/76.jpg)
Stage 3: Tree-building models: distance
Page 379
Jukes and Cantor (1969) proposed a corrective formula:
D = (- ) ln (1 – p)34
43
Consider an alignment where 3/60 aligned residues differ.The normalized Hamming distance is 3/60 = 0.05.The Jukes-Cantor correction is
D = (- ) ln (1 – 0.05) = 0.05234
43
When 30/60 aligned residues differ, the Jukes-Cantor correction is more substantial:
D = (- ) ln (1 – 0.5) = 0.8234
43
![Page 77: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/77.jpg)
► ►
Use MEGA to display a pairwise distance matrix of 13 globins
![Page 78: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/78.jpg)
►►
![Page 79: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/79.jpg)
►►
![Page 80: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/80.jpg)
![Page 81: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/81.jpg)
![Page 82: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/82.jpg)
2E Page 37
Gamma models account for unequal substitution rates across variable sites
![Page 83: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/83.jpg)
= 0.25
= 1
= 5
◄
◄
◄
![Page 84: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/84.jpg)
Introduction to evolution and phylogeny
Nomenclature of trees
Five stages of molecular phylogeny:[1] selecting sequences[2] multiple sequence alignment[3] models of substitution[4] tree-building[5] tree evaluation
Goals of the lecture
![Page 85: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/85.jpg)
We will discuss two tree-building methods:distance-based and character-based.
Distance-based methods involve a distance metric,such as the number of amino acid changes betweenthe sequences, or a distance score. Examples ofdistance-based algorithms are UPGMA and neighbor-joining.
Stage 4: Tree-building methods
Page 377
![Page 86: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/86.jpg)
Distance-based methods involve a distance metric,such as the number of amino acid changes betweenthe sequences, or a distance score. Examples ofdistance-based algorithms are UPGMA and neighbor-joining.
Character-based methods include maximum parsimonyand maximum likelihood. Parsimony analysis involvesthe search for the tree with the fewest amino acid(or nucleotide) changes that account for the observeddifferences between taxa.
Stage 4: Tree-building methods
Page 377
![Page 87: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/87.jpg)
We can introduce distance-based and character-based tree-building methods by referring to a tree of 13orthologous retinol-binding proteins, and the multiple sequence alignment from which the treewas generated.
Stage 4: Tree-building methods
Page 378
![Page 88: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/88.jpg)
Orthologs:members of a gene (protein)family in variousorganisms.This tree showsRBP orthologs.
common carp
zebrafish
rainbow trout
teleost
African clawed frog
chicken
mouserat
rabbitcowpighorse
human
10 changes Page 43
![Page 89: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/89.jpg)
Fish RBP orthologs
common carp
zebrafish
rainbow trout
teleost
African clawed frog
chicken
mouserat
rabbitcowpighorse
human
10 changes Page 43
Other vertebrateRBP orthologs
![Page 90: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/90.jpg)
Fig. 11.13Page 376
![Page 91: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/91.jpg)
Fig. 11.13Page 376
Distance-based treeCalculate the pairwise alignments;if two sequences are related,put them next to each other on the tree
![Page 92: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/92.jpg)
Fig. 11.13Page 376
Character-based tree: identify positions that best describe how characters (amino acids) are derived from common ancestors
![Page 93: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/93.jpg)
Regardless of whether you use distance- or character-based methods for building a tree,the starting point is a multiple sequence alignment.
ReadSeq is a convenient web-based program thattranslates multiple sequence alignments intoformats compatible with most commonly usedphylogeny programs such as PAUP and PHYLIP.
Stage 4: Tree-building methods
Page 378
![Page 94: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/94.jpg)
http://evolution.genetics.washington.edu/phylip/software.html
This site lists 200 phylogeny packages. Perhaps the best-known programs are PAUP (David Swofford and colleagues)and PHYLIP (Joe Felsenstein).
![Page 95: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/95.jpg)
ReadSeq is widely available; try the “tools” menu at the LANL HIV database
![Page 96: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/96.jpg)
[1] distance-based
[2] character-based: maximum parsimony
[3] character- and model-based: maximum likelihood
[4] character- and model-based: Bayesian
Stage 4: Tree-building methods
![Page 97: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/97.jpg)
Stage 4: Tree-building methods: distance
Page 379
Many software packages are available for makingphylogenetic trees.
![Page 98: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/98.jpg)
Stage 4: Tree-building methods: distance
Page 379
Many software packages are available for makingphylogenetic trees. We will describe two programs.
[1] MEGA (Molecular Evolutionary Genetics Analysis) by Sudhir Kumar, Koichiro Tamura, and Masatoshi Nei. Download it from http://www.megasoftware.net/
[2] Phylogeny Analysis Using Parsimony (PAUP), written by David Swofford. See http://paup.csit.fsu.edu/.
We will next use MEGA and PAUP to generate trees by the distance-based method UPGMA.
![Page 99: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/99.jpg)
How to use MEGA to make a tree
[1] Enter a multiple sequence alignment (.meg) file[2] Under the phylogeny menu, select one of these four methods…
Neighbor-Joining (NJ)Minimum Evolution (ME)Maximum Parsimony (MP)UPGMA
![Page 100: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/100.jpg)
Use of MEGA for a distance-based tree: UPGMA
Click computeto obtain tree
Click green boxesto obtain options
![Page 101: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/101.jpg)
Use of MEGA for a distance-based tree: UPGMA
![Page 102: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/102.jpg)
Use of MEGA for a distance-based tree: UPGMA
A variety of styles are available for tree display
![Page 103: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/103.jpg)
Use of MEGA for a distance-based tree: UPGMA
Flipping branches around a node createsan equivalent topology
![Page 104: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/104.jpg)
Tree-building methods: UPGMA
UPGMA is unweighted pair group methodusing arithmetic mean
1 2
3
4
5
Fig. 11.17Page 382
![Page 105: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/105.jpg)
Tree-building methods: UPGMA
Step 1: compute the pairwise distances of allthe proteins. Get ready to put the numbers 1-5at the bottom of your new tree.
1 2
3
4
5
Fig. 11.17Page 382
![Page 106: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/106.jpg)
Tree-building methods: UPGMA
Step 2: Find the two proteins with the smallest pairwise distance. Cluster them.
1 2
3
4
5
1 2
6
Fig. 11.17Page 382
![Page 107: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/107.jpg)
Tree-building methods: UPGMA
Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them.
1 2
3
4
5
1 2
6
4 5
7
Fig. 11.17Page 382
![Page 108: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/108.jpg)
Tree-building methods: UPGMA
Step 4: Keep going. Cluster.
1 2
3
4
5 1 2
6
4 5
7
3
8
Fig. 11.17Page 382
![Page 109: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/109.jpg)
Tree-building methods: UPGMA
Step 4: Last cluster! This is your tree.
1 2
3
4
5
1 2
6
4 5
7
3
8
9
Fig. 11.17Page 382
![Page 110: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/110.jpg)
UPGMA is a simple approach for making trees.
• An UPGMA tree is always rooted.• An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong.• While UPGMA is simple, it is less accurate than the neighbor-joining approach (described next).
Distance-based methods: UPGMA trees
Page 383
![Page 111: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/111.jpg)
The neighbor-joiningmethod of Saitou and Nei(1987) Is especially usefulfor making a tree having a large number of taxa.
Begin by placing all the taxa in a star-like structure.
Making trees using neighbor-joining
Page 383
![Page 112: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/112.jpg)
Fig. 11.18Page 384
Tree-building methods: Neighbor joining
Next, identify neighbors (e.g. 1 and 2) that are most closelyrelated. Connect these neighbors to other OTUs via aninternal branch, XY. At each successive stage, minimizethe sum of the branch lengths.
![Page 113: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/113.jpg)
Fig. 11.18Page 384
Tree-building methods: Neighbor joining
Define the distance from X to Y by
dXY = 1/2(d1Y + d2Y – d12)
![Page 114: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/114.jpg)
Use of MEGA for a distance-based tree: NJ
Neighbor Joining produces areasonably similar tree asUPGMA
![Page 115: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/115.jpg)
Fig. 11.19Page 385
Example of aneighbor-joiningtree: phylogeneticanalysis of 13RBPs
![Page 116: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/116.jpg)
We will discuss four tree-building methods:
[1] distance-based
[2] character-based: maximum parsimony
[3] character- and model-based: maximum likelihood
[4] character- and model-based: Bayesian
Stage 4: Tree-building methods
![Page 117: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/117.jpg)
Tree-building methods: character based
Rather than pairwise distances between proteins,evaluate the aligned columns of amino acidresidues (characters).
Tree-building methods based on characters includemaximum parsimony and maximum likelihood.
Page 383
![Page 118: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/118.jpg)
The main idea of character-based methods is to findthe tree with the shortest branch lengths possible.Thus we seek the most parsimonious (“simple”) tree.
• Identify informative sites. For example, constant characters are not parsimony-informative.
• Construct trees, counting the number of changesrequired to create each tree. For about 12 taxa orfewer, evaluate all possible trees exhaustively; for >12 taxa perform a heuristic search.
• Select the shortest tree (or trees).
Making trees using character-based methods
Page 383
![Page 119: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/119.jpg)
As an example of tree-building using maximum parsimony, consider these four taxa:
AAGAAAGGAAGA
How might they have evolved from a common ancestor such as AAA?
Fig. 11.20Page 385
![Page 120: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/120.jpg)
AAG AAA GGA AGA
AAAAAA
1 1AGA
AAG AGA AAA GGA
AAAAAA
1 2AAA
AAG GGA AAA AGA
AAAAAA
1 1AAA
1 2
Tree-building methods: Maximum parsimony
Cost = 3 Cost = 4 Cost = 4
Fig. 11.20Page 385
1
In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).
![Page 121: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/121.jpg)
MEGA for maximum parsimony (MP) trees
Options include heuristic approaches,and bootstrapping
![Page 122: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/122.jpg)
MEGA for maximum parsimony (MP) trees
In maximum parsimony, there may be more than one treehaving the lowest total branch length. You may computethe consensus best tree.
![Page 123: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/123.jpg)
Fig. 11.22Page 387
Phylogram
(values are proportionalto branchlengths)
![Page 124: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/124.jpg)
Fig. 11.22Page 387
Rectangularphylogram
(values are proportionalto branchlengths)
![Page 125: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/125.jpg)
Fig. 11.22Page 387
Cladogram
(values are not proportionalto branchlengths)
![Page 126: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/126.jpg)
Fig. 11.22Page 387
Rectangularcladogram
(values are not proportionalto branchlengths)
These four trees display the same datain different formats.
![Page 127: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/127.jpg)
We will discuss four tree-building methods:
[1] distance-based
[2] character-based: maximum parsimony
[3] character- and model-based: maximum likelihood
[4] character- and model-based: Bayesian
Stage 4: Tree-building methods
![Page 128: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/128.jpg)
Maximum likelihood is an alternative to maximumparsimony. It is computationally intensive. A likelihoodis calculated for the probability of each residue inan alignment, based upon some model of thesubstitution process.
What are the tree topology and branch lengths that have the greatest likelihood of producing the observed data set?
ML is implemented in the TREE-PUZZLE program,as well as PAUP and PHYLIP.
Making trees using maximum likelihood
Page 386
![Page 129: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/129.jpg)
(1) Reconstruct all possible quartets A, B, C, D. For 12 myoglobins there are 495 possible quartets.
(2) Puzzling step: begin with one quartet tree. N-4 sequences remain. Add them to the branches systematically, estimating the support for each internal branch. Report a consensus tree.
Maximum likelihood: Tree-Puzzle
![Page 130: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/130.jpg)
Maximum likelihood tree
![Page 131: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/131.jpg)
Quartet puzzling
![Page 132: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/132.jpg)
We will discuss four tree-building methods:
[1] distance-based
[2] character-based: maximum parsimony
[3] character- and model-based: maximum likelihood
[4] character- and model-based: Bayesian
Stage 4: Tree-building methods
![Page 133: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/133.jpg)
Calculate:
Pr [ Tree | Data] =
Bayesian inference of phylogeny with MrBayes
Pr [ Data | Tree] x Pr [ Tree ]
Pr [ Data ]
Pr [ Tree | Data ] is the posterior probability distribution of trees. Ideally this involves a summation over all possible trees. In practice, Monte Carlo Markov Chains (MCMC) are run to estimate the posterior probability distribution.
Notably, Bayesian approaches require you to specify prior assumptions about the model of evolution.
![Page 134: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/134.jpg)
Introduction to evolution and phylogeny
Nomenclature of trees
Five stages of molecular phylogeny:[1] selecting sequences[2] multiple sequence alignment[3] models of substitution[4] tree-building[5] tree evaluation
Goals of the lecture
![Page 135: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/135.jpg)
The main criteria by which the accuracy of a phylogentic tree is assessed are consistency,efficiency, and robustness. Evaluation of accuracy can refer to an approach (e.g. UPGMA) or to a particular tree.
Stage 5: Evaluating trees
Page 386
![Page 136: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/136.jpg)
Bootstrapping is a commonly used approach tomeasuring the robustness of a tree topology.Given a branching order, how consistently doesan algorithm find that branching order in a randomly permuted version of the original data set?
Stage 5: Evaluating trees: bootstrapping
Page 388
![Page 137: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/137.jpg)
Bootstrapping is a commonly used approach tomeasuring the robustness of a tree topology.Given a branching order, how consistently doesan algorithm find that branching order in a randomly permuted version of the original data set?
To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. Make the dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates.Observe the percent of cases in which the assignmentof clades in the original tree is supported by the bootstrap replicates. >70% is considered significant.
Stage 5: Evaluating trees: bootstrapping
Page 388
![Page 138: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/138.jpg)
MEGA for maximum parsimony (MP) trees
Bootstrap values show the percent of times each cladeis supported after a large number (n=500) of replicatesamplings of the data.
![Page 139: Molecular Phylogeny and Evolution December 15, 2008 Bioinformatics J. Pevsner pevsner@kennedykrieger.org](https://reader035.vdocuments.net/reader035/viewer/2022062422/56649eb45503460f94bbb92e/html5/thumbnails/139.jpg)
In 61% of the bootstrapresamplings, ssrbp and btrbp(pig and cow RBP) formed adistinct clade. In 39% of the cases, another protein joinedthe clade (e.g. ecrbp), or oneof these two sequences joinedanother clade.
Fig. 11.24Page 388