from basic concepts to advanced applications molecular evolution & phylogeny by ofir cohen the...
TRANSCRIPT
From basic Concepts to Advanced applications
Molecular Evolution & Phylogeny
By Ofir Cohen
The Bioinformatics UnitG.S. Wise Faculty of Life Science
Tel Aviv University, IsraelMay 2012
http://ibis.tau.ac.il/intro_bioinfo/phylogenyWorkshop/
2 of 50
The Human Genome Project ("behind
the scene”) Venter et. al. , Science 292:1304-1351 (2001)
International Human Genome Sequencing Consortium, Nature, 409: 860-921 (2001)
The club resident J.D. Watson: Back2back with DJ. Venter -
3 of 50
Genome Sequencing – Ongoing Revolution
The race is (still) on… The promise is huge…
4 of 50
SRA(Sequence Read Archive)= raw seq. from next-generation machines
Trace=raw seq. from 90s machines)
5 of 50
Darwin’s teachings–Tree-like evolution
Introduction – The tree concept
6 of 50
Darwin’s teachings– common descent Introduction – The tree concept
7 of 50
Common Descent – Modern evidence
Introduction – The tree concept
"The unity of life is no less remarkable than its diversity" "The unity of life is no less remarkable than its diversity" THEODOSIUS DOBZHANSK
8 of 50
Mathematicians developed tools to analyze Trees
Adapted from Huson et al. 2008
connected graph without cycles is a tree. Not a tree! (cycle) Rooted binary treeTree
Part of the wider field of graph theory
Bridges of Königsberg
9 of 50
What is a Phylogenetic Tree? Phylogenetic tree:
(hypothetical) historical pattern of evolutionary relationships among organisms
Introduction – The tree concept
Homo
Bos
Mus
Rattus0.011
0.025
0.012
0.011
Gallus
0.038
0.066
0.01
Root
Node
Leaf
Branch
(Greek: phylon = race and genetic = birth)
sps
Horizontal branch length –proportional to evolutionary distances (unit = substitution / site)
10 of 50
Molecular evidence of HIV transmission in a
criminal case
Introduction - Anecdotes
Metzker, Michael L. et al. (2002) Proc. Natl. Acad. Sci. USA 99, 14292-14297
11 of 50
Criminal investigation
August 1994 a nurse tests negative for HIV. breaks off a messy 10 year affair with a doctor. Three weeks later the doctor gives his ex-mistress a vitamin B-12 shot
In January 1995, the nurse tests positive for both HIV and hepatitis C.
The doctor’s office records from the day are missing (but eventually found). The doctor had withdrawn blood samples from a known HIV patient and a known hepatitis C patient
the same day as the vitamin B-12 shot. The nurse had never had contact with either patient
Introduction - Anecdotes
Circumstantial evidence that the doctor injected blood from a patient of his into this ex-girlfriend….
How can this be proved using a phylogenetic approach?
12 of 50
HIV – short background
Extreme heterogeneity Within each patient there are many different viral
strains ("quasi-species")
Introduction - Anecdotes
13 of 50
History of the virus:
gp120(Gene tree)
PATIENT
VICTIM
CONTROLS
©2002 National Academy of Sciences, U.S.A.
Introduction - Anecdotes
14 of 50
History of the virus:
RT (Gene tree)
VICTIM
PATIENT
Introduction - Anecdotes
Source sequences that are paraphyletic (other sequences are nested within them)
with respect to the recipient sequences provide evidence for the direction of transmission.
15 of 50
Ernst Haeckel's Monophyletic tree of organisms, 1866
Reconstructing the tree of life
16 of 50
Organisms classified into 2 domains: Eukaryotes including {plants, animals,
protists, fungi} Prkaryotes = Bacteria
Whittaker , 1969
17 of 50
Reconstructing the Tree Of Life Carl Woese, 1977. phylogenetic taxonomy of 16S ribosomal RNA
Critiques: Woese un-balanced the tree of life… (too much representation for microbial species)
18 of 50
Phylogenetic analysis: Not only among organisms - Cancer
phylogenyA phylogeny of acute myeloid leukemia (AML) subtypes
Riester et al. 2010Liu et al. 2009
19 of 50
Phylogenetic analysis: Not only in biology – Language evolution
Russell and Atkinson. 2003
Researchers learn the evolution of languages by treating them like genomes.
Instead of COGs (gene families), analyze COGNATES (words families)
20 of 50
Reading Trees: Which tree is more accurate?
Reading Trees
Haeckel’s pedigree of man
Human "on top" – wrong!
21 of 50
Rooted vs. Un-rooted treesRooted vs. Un-rooted treesRooted vs. Un-rooted treesRooted vs. Un-rooted trees
human
mousefugu
Drosophila
root
edge
internal nodeleaf
human
mouse
fuguDrosophila
root
edge
internal node (ancestor)
leaf
time
Reading Trees
22 of 50
Gorilla gorilla(Gorilla)
Homo sapiens (human)
Pan troglodytes (Chimpanzee)
Gallus gallus (chicken)
How do we root a tree? Reading Trees
23 of 50
Rooting based on a priori knowledge: Using Outgroup
Human
Chimp
Chicken
Gorilla
Human ChimpChicken Gorilla
Reading Trees
24 of 50
Comparative Genomics – "All life is one"
Compare homologues sequences – Multiple Sequence Alignments
25 of 50
Orthologs
speciation
ancestor
descendant 2 (e.g., dog)descendant 1 (e.g., human)
Orthologs will typically have the same or similar function in the course of evolution.
26 of 50
Paralogs
Duplication
Evolutionary innovation - lack of the original selective pressure upon one copy, this copy is free to mutate and acquire new functions.
27 of 50
Alignment and phylogeny are mutually dependant
Inaccurate tree building
MSA
Sequence alignment
0.4
Phylogeny reconstruction
Unaligned sequences
28 of 50
Part II: Tools
29 of 50
Multiple sequence alignment (MSA)
Several advanced MSA programs are available.Today we will use two:
MAFFT – fast and relatively accurate PRANK – distinct from all other MSA programs because
of its correct treatment of insertions/deletions
Tools - Alignments
30 of 50
MAFFT Web server (& download option):
http://mafft.cbrc.jp/alignment/server/index.html Efficiency-tuned variants
quick & dirty or slow but accurate
Nucleic Acids Research, 2002, Vol. 30, No. 14 3059-3066© 2002 Oxford University Press
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform
Kazutaka Katoh, Kazuharu Misawa1, Kei-ichi Kuma and Takashi Miyata*
Tools - Alignments
31 of 50
Choosing a MAFFT strategy
qu
ick &
dirty slow
bu
t accurate
Tools - Alignments
32 of 50
MAFFT outputSaving the output Choose a format: Clustal, Fasta, or
click "Reformat" to convert to a selection of other formats
Save page as a text file
A colored view of the alignment
Tools - Alignments
33 of 50
PRANKTools - Alignments
34 of 50
Classical alignment errors for HIV env
Tools - Alignments
CLUSTALW PRANK
35 of 50
PRANK Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/
Tools - Alignments
36 of 50
PRANK output
If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/
Tools - Alignments
38381. Download the sequence files from the web-site
http://ibis.tau.ac.il/intro_bioinfo/phylogenyWorkshop/Open "fahA.fas" in Notepad/Browser – these are 65 protein sequences in FASTA format.
2. Run PRANK web serverhttp://www.ebi.ac.uk/goldman-srv/webPRANK/
(1)
39 of 50
Trees Reconstruction Methods
40 of 50
Phylogeny reconstructionDifferent approaches (algorithms / programs): Distance based methods (e.g. neighbor-joining, as in ClustalW)
Fast but inaccurate Maximum parsimony (e.g. MEGA) Maximum likelihood methods (e.g. phyML, RAxML)
Accurate but slower Bayesian methods (e.g. MrBayes)
Most accurate but very slow
ABCDE
Guide tree
A
DCB
E
MSA
Pairwise distance table
Tools - Trees
41 of 50
PhyMLThe most widely used maximum likelihood (ML) program Web server (& download): http://www.atgc-montpellier.fr/phyml/
Tools - Trees
44 of 50
RAxML Web server: http://phylobench.vital-it.ch/raxml-bb/ Similar maximum likelihood (ML) methodology as phyML, but much faster
Faster results with bootstrap
Tools - Trees
45 of 50
Bootstrapping
Now we have a tree, but what is the reliability of this tree?
46 of 50
BootstrapA. Generate pseudo-data sets by sampling N positions Do not change the number of sequences. Resample (100-1000 time). 12345 100
1 : ATCTG…A 2 : ATCTG…C3 : ACTTA…C 4 : ACCTA…T
12345 1001 : AATTT…T2 : AATTT…G3 : AACTT…T4 : AACTT…T 11244 x
12345 1001 : TTTAT…T2 : TAACC…G3 : TAACC…T4 : TGGGA…T 4 7789…x
12345 1001 : AGGTA…T2 : AGGAC…G3 : AAAAC…A4 : AAAGG…C 15578… x
47 of 50
BootstrapB. Reconstruct a tree from each data set.
12345 1001 : AATTT…T2 : AATTT…G3 : AACTT…T4 : AACTT…T 11244 x
12345 1001 : TTTAT…T2 : TAACC…G3 : TAACC…T4 : TGGGA…T 4 7789…x
12345 1001 : AGGTA…T2 : AGGAC…G3 : AAAAC…A4 : AAAGG…C 15578… x
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
48 of 50
C. compute the majority rule consensus.
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3Sp4
Sp1Sp2
Sp3
Sp4
67%100%
In 67% of the data sets, the split between SP1+SP2 and the rest of the tree was found.
Bootstrap
4949
1. Give "fahA.prank.phylip" or "fahA.mafft.phylip" as input to the RAxML webserver (don't forget to tick "Protein sequences" and “Maximum likelihood search” and enter your email)
(3)
50 of 50
FigTree: tree visualization and figure creation
http://tree.bio.ed.ac.uk/software/figtree/
Manipulate a node
Manipulate a clade
Manipulate a taxon
5151
1. In case tree are not ready yet… download tree from website
2. Open "fahA.prank.phylip_phyml_tree.txt" in FigTree http://tree.bio.ed.ac.uk/software/figtree/
3. Play around with the different options and make a pretty figure!
1. Find out how to color specific clades, as below
2. Try each of the three options under "Layout"
4. Export a figure in PDF format(File Export Graphic…)
(4)
52 of 50
Final Questions…