lecture 5 : phylogenies 9/16/09. translated blast = protein vs translated database
Post on 19-Dec-2015
217 views
TRANSCRIPT
Lecture 5 : PhylogeniesLecture 5 : Phylogenies
9/16/09
Translated blast = protein vs translated database
Blasting Genbank - blastnBlasting Genbank - blastn
Z. bruijni - long beaked echidna T. aculeatus - echidna T. rostratus = honey possum
AX8GS9DG01S
Blasting Genbank - discont Blasting Genbank - discont megablast - exactly same as megablast - exactly same as
blastnblastn
Z. bruijni - long beaked echidna T. aculeatus - echidna T. rostratus = honey possum
AX9N23U7014
Blasting Genbank - megablast - Blasting Genbank - megablast - same species but different ordersame species but different order
Z. bruijni - long beaked echidna T. aculeatus - echidna T. rostratus = honey possum
AX9TUM1G016
Blasting Genbank - Blasting Genbank - TblastnTblastn
AX9DYYTE01N
T. aculeatus - echidna S. brachyurus - quokka S. crassicaudata - fat tailed dunnart M. fasciatus - numbat I. obesulus - quenda
Species found by BLASTSpecies found by BLAST
I. obesulus = quenda = bandicoot
T. aculeatus = echidna
M. fasciatus = numbat
T. rostratus = honey possum S. crassicaudata
= fat tailed dunnart
O. anatinus = platypus
S. brachyurus = quokka
Z. bruijni - Long beaked echidna
Homologene - can be reached Homologene - can be reached from NCBI home pagefrom NCBI home page
Scroll down - they are listed alphabetically
QuestionsQuestions
Phylogenies - what are they?
1. How do we build them?
2. What do they tell us?
PhylogenyPhylogeny Evolutionary
history of a a group of organisms, especially as depicted in a family tree
Haeckel, 1879
Things trees might tell Things trees might tell you :you :
How are organisms with particular trait related?
Did trait evolve multiple times or only once?
What is evolutionary pathwayOf organismsOf genes
Molecules can be used to Molecules can be used to learn how organisms are learn how organisms are
relatedrelated
To learn about vertebrate To learn about vertebrate evolution: Compare >600 genesevolution: Compare >600 genes
1998
Used genes to measure time
1) Time since common ancestor with human
2) Time since two groups diverged
More recent version of vertebrate evolution which shows divergence times on the animal tree
Ponting 2008
OrangutanHumanChimpRhesus monkey
MouseRat
DogCatHorseCowOpposum
Wallaby
Anole
Chicken
FrogFish -Medaka Fugu Tetraodon ZebrafishElephant sharkLamprey
Platypus
Primates 25 MY
Mammals 100 MY100 MY
All vertebrates 550 MY
Tetrapods 420 MY420 MY
Fish 320 MY
Molecular clockMolecular clock
Molecules change at a steady rate We can calibrate how fast they
change using fossils The molecules then become a time
piece to measure how recently different groups split off from each other
Sequence conservation may Sequence conservation may be highbe high
Gene might code for a protein which is highly constrained
Might have to interact with lots of other proteins
Selection might be quite strong
Sequence conservation may Sequence conservation may be lowbe low
Not much constraint
Few sites of interaction
Selection might be weak
Phylogeny stepsPhylogeny steps
Align sequences so homologous AA can be compared
Determine the similarity between sequences
Use this to generate a relationship between sequences
Clustalw2 to align Clustalw2 to align sequencessequences
Put sequences in FASTA Put sequences in FASTA filefile
>TetraodonG1MVWDGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYPQYYLVDPIMFKMLALYMFFLICTGTPINGLTLLVTAQNKKLRQPLNYILVNLAVAGLIMCAFGFTITITSAINGYFILGATACAVEGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFTGTHAAVGVLFTWIMAFACAGPPLFGWSRYLPEGMQCSCGPDYYTLAPGYNNESYVIYMFVVHFFVPVFLIFFTYGSLVLTVRAAAQQQESESTQKAQREVTRMCILMVLGFLVAWTPYATFSGWIFMNKGAAFHPLTAALCAFFAKSSALYNPVIYVLMNKQFRNCMLSTFGMGGAVDDETSVSASKTEVSSVS
>ZebrafishG1MNGTEGSNFYIPMSNRTGLVRSPYDYTQYYLAEPWKFKALAFYMFLLIIFGFPINVLTLVVTAQHKKLRQPLNYILVNLAFAGTIMVIFGFTVSFYCSLVGYMALGPLGCVMEGFFATLGGQVALWSLVVLAIERYIVVCKPMGSFKFSANHAMAGIAFTWFMACSCAVPPLFGWSRYLPEGMQTSCGPDYYTLNPEYNNESYVMYMFSCHFCIPVTTIFFTYGSLVCTVKAAAAQQQESESTQKAEREVTRMVILMVLGFLFAWVPYASFAAWIFFNRGAAFSAQAMAVPAFFSKTSAVFNPIIYVLLNKQFRSCMLNTLFCGKSPLGDDESSSVSTSKTEVSSVSPA
>CichlidG1MAWEGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYTQYYLADPIFFKLLAFYMFFLICTGTPINSLTLFVTAQNKKLRQPLNYILVNLAVAGLIMCCFGFTITITSAFNGYFILGSTFCAIEGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFSGAHAGAGVLFTWIMAMACAAPPLFGWSRYIPEGMQCSCGPDYYTLAPGFNNESYVIYMFVVHFFVPVFIIFFTYGSLVMTVKAAAAQQQDSASTQKAEKEVTRMCVLMVMGFLIAWTPYASFAGWIFMNKGASFSALTAAIPAFFAKSSALYNPVIYVLMNKQFRNCMLSTIGMGGMVEDETSVSTSKTEVSSVS
Aligned sequences .aln ; Jalview gives colored version
Funky tree .dnd (need special program to draw)
Scroll down this page for tree (use Phylogram)
CLUSTAL W (1.83) multiple sequence alignment
TetraodonG1 MVWDGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYPQYYLVDPIMFKMLALYMFFLICTGT 60CichlidG1 MAWEGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYTQYYLADPIFFKLLAFYMFFLICTGT 60ZebrafishG1 --------MNGTEGSNFYIPMSNRTGLVRSPYDYTQYYLAEPWKFKALAFYMFLLIIFGF 52 *****.***********:****::*.****.:* ** **:***:** *
TetraodonG1 PINGLTLLVTAQNKKLRQPLNYILVNLAVAGLIMCAFGFTITITSAINGYFILGATACAV 120CichlidG1 PINSLTLFVTAQNKKLRQPLNYILVNLAVAGLIMCCFGFTITITSAFNGYFILGSTFCAI 120ZebrafishG1 PINVLTLVVTAQHKKLRQPLNYILVNLAFAGTIMVIFGFTVSFYCSLVGYMALGPLGCVM 112 *** ***.****:***************.** ** ****::: .:: **: **. *.:
TetraodonG1 EGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFTGTHAAVGVLFTWIMAFACAGPPL 180CichlidG1 EGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFSGAHAGAGVLFTWIMAMACAAPPL 180ZebrafishG1 EGFFATLGGQVALWSLVVLAIERYIVVCKPMGSFKFSANHAMAGIAFTWFMACSCAVPPL 172 ***:*****:**************************:. ** .*: ***:** :** ***
TetraodonG1 FGWSRYLPEGMQCSCGPDYYTLAPGYNNESYVIYMFVVHFFVPVFLIFFTYGSLVLTVR- 239CichlidG1 FGWSRYIPEGMQCSCGPDYYTLAPGFNNESYVIYMFVVHFFVPVFIIFFTYGSLVMTVKA 240ZebrafishG1 FGWSRYLPEGMQTSCGPDYYTLNPEYNNESYVMYMFSCHFCIPVTTIFFTYGSLVCTVKA 232 ******:***** ********* * :******:*** ** :** ********* **:
TetraodonG1 AAAQQQESESTQKAQREVTRMCILMVLGFLVAWTPYATFSGWIFMNKGAAFHPLTAALCA 299CichlidG1 AAAQQQDSASTQKAEKEVTRMCVLMVMGFLIAWTPYASFAGWIFMNKGASFSALTAAIPA 300ZebrafishG1 AAAQQQESESTQKAEREVTRMVILMVLGFLFAWVPYASFAAWIFFNRGAAFSAQAMAVPA 292 ******:* *****::***** :***:***.**.***:*:.***:*:**:* . : *: *
TetraodonG1 FFAKSSALYNPVIYVLMNKQFRNCMLSTFGMGG--AVDDETS-VSASKTEVSSVS-- 351CichlidG1 FFAKSSALYNPVIYVLMNKQFRNCMLSTIGMGG--MVEDETS-VSTSKTEVSSVS-- 352ZebrafishG1 FFSKTSAVFNPIIYVLLNKQFRSCMLNTLFCGKSPLGDDESSSVSTSKTEVSSVSPA 349 **:*:**::**:****:*****.***.*: * :**:* **:*********
Alignment is keyAlignment is key
Any other analysis that you do is only as good as your alignment
If your alignment is bad subsequent analyses will be bad
Junk in = Junk out
AlignmentsAlignments
Tell you about sequence conservationHow much is there?Where is it?
Calculate sequence Calculate sequence similaritiessimilarities
Zebrafish M--------NGTEGSNFYIPMSNR Trout M------Q-NGTEGSNFYIPMSNR Medaka M------E-NGTEGKNFYIPMNNR Cod M----RMEANGTEGKNFYIPMSNR Halibut MVWDGGIEPNGTEGKNFYIPMSNR Tetraodon MVWDGGIEPNGTEGKNFYIPMSNR Goldfish M--------NGTEGNNFYVPLSNR Killifish M---GYG-PNGTEGNNFYIPMSNK * *****.***:*:.*:
Pairwise comparisons
Use tree to show Use tree to show sequence relationshipssequence relationships
Short branches mean sequences are more similarLong branches mean there are more differences
Q3. How do we build Q3. How do we build phylogenies?phylogenies?
Assume the relationships involve bifurcating branches
ATC
ATG
ACG
CCG
CCC
ATC
ATG
ACG
CCG
CCC
Methods to determine Methods to determine similaritiessimilarities
Parsimony
Distance
Maximum likelihood
Bayesian
ParsimonyParsimony
The least complex explanation is the most likely to be correctOccam’s razor
The preferred phylogenetic tree is one that requires fewest changes Count up # changes for all possible
treesFind the shortest one
Trees based on parsimonyTrees based on parsimony
ATCG
ATCG
ACCG
ACCG
ATCG
ACCG
ATCG
ACCG
CT
CT
CT
Most parsimonious
Trees based on parsimonyTrees based on parsimony
T
T
C
C
T
C
T
C
CT
CT
CT
Most parsimonious
Can’t always distinguish tree Can’t always distinguish tree topologiestopologies
T
T
C
C
T
T
C
C
CT CT
Equally parsimonious
Other limitationsOther limitations
All changes are weighted the sameC-T same as C - ASame no matter how long it takes for
the change to occur
Distance methodsDistance methods
Calculate a numerical value for sequence differencesDo for all pairwise combinations
Build tree by joining most similar sequences and then more divergent
Distance methodsDistance methods
Fast Pretty robust Only deals with data in pairs
Pairwise distancesPairwise distances
Taxa1 AACGGTCATGGCGTTGCATTTaxa2 AACGGTCAGGGCGTTGCATTTaxa3 AACGGTCACGCCGCTGCATT
1 2 3
1 0 .05 .15
2 .05 0 .15
3 .15 .15 0
Distance, dDistance, d
p is fractional similarity of sequence
Simplest form of distance: d = 1 - p
AACGGTCATGGCGTTGCATTAACGGTCACGGCGTTGCATT
p = 19/20 d = 0.05
Tree buildingTree building
Neighbor joiningJoin most similar pair of sequencesAdd more divergent after
1 2 3
1 0 .05 .15
2 .05 0 .15
3 .15 .15 0
1
2
3
How different can 2 sequences How different can 2 sequences get?get?
At infinite time, random probability that two sequences are the sameProbability a base is same = 1/4
DNA only has 4 basesCertain sites will start to change
multiple timesNeed to account for these multiple hits
Random sequencesRandom sequences
Write down 20 bases of sequence
Compare your sequence Compare your sequence to this oneto this one
AGTCCGATTACGGCTAGCAG
What fraction of sites are the same in the two sequences?
Sequence similarity Sequence similarity decays to 25% over long decays to 25% over long
timestimes
0
0.2
0.4
0.6
0.8
1
1.2
0 0.5 1 1.5 2 2.5 3 3.5
Time
Sequence similarity
Sequence difference Sequence difference maxes at 0.75maxes at 0.75
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3 3.5
Time
Sequence difference
Sequence change accumulates Sequence change accumulates linearly with time at beginninglinearly with time at beginning
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1 1.5 2 2.5 3 3.5
Time
Sequence difference
DNA modelsDNA models Use different DNA models to
account for how sequences evolve with timeAllows you to apply different molecular
clocksRelate sequence change to timeClock is not linear except for small
changes and short times Models same as used in maximum
likelihood methods
How good is your tree?How good is your tree?
Bootstrap approachRun the same method multiple timesSubsample data each time
Use 50% of dataSee how reproducible the trees areCount how many times a particular
grouping occurs
Distance tree Distance tree for rod and for rod and cone cone transducin transducin alpha alpha subunitsubunit
Branch lengths Branch lengths are are proportional to proportional to sequence sequence
differencesdifferences
Boot strap values are given for each node which tells how reproducible that
grouping is
58
100
100
95
98
72
69
72
98
86
98
68
97