more sequence or more individuals, to combine or not?

82
Data: how much is needed? more sequence or more individuals, to combine or not?

Upload: ethan-dalton

Post on 13-Dec-2015

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: More sequence or more individuals, to combine or not?

Data: how much is needed?

more sequence or more individuals, to combine or not?

Page 2: More sequence or more individuals, to combine or not?

14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno)

20.4. Mon Assessing hypotheses (Jarno) 21.4. Tue Problems with molecular data

(Jarno) 23.4. Thu Problems with molecular data (Jarno) Phylogenomics 24.4. Fri Search algorithms, visualization, and other computational aspects (Jarno)

Schedule

J

Page 3: More sequence or more individuals, to combine or not?

The trivial truth◦ All extant species◦ The whole genome

Impractical? Well, then◦ As many species as possible◦ As much data as possible

How much data?

Page 4: More sequence or more individuals, to combine or not?

Finite constraints on resources (time, money)◦ Know your group – which taxa are the most

relevant for your study?◦ Know what gene sequences are available from

previous studies

Choosing taxa or data

Page 5: More sequence or more individuals, to combine or not?

The days of single gene datasets are over Mitochondrial and chloroplast DNA have

been popular because they are easy to amplify and sequence

It is worth increasing the number of nuclear genes

One should aim for at least 3 genes, preferably more (maybe 10?)

Number of genes

Page 6: More sequence or more individuals, to combine or not?
Page 7: More sequence or more individuals, to combine or not?

It is now possible to increase the number of genes being sequenced significantly

Whole genome analyses will allow us to understand:◦ Intron-exon boundary dynamics◦ Gene duplication-deletion dynamics◦ Gene transfer dynamics

Soon we will have a good understanding of the regions of the genome that are most suitable for systematics

Phylogenomics

Page 8: More sequence or more individuals, to combine or not?

Sometimes not all genes amplify from all samples◦ Should these samples be discarded?

Increased taxon sampling, despite missing data, increases resolution

All possible data should be used!

Missing data?

Page 9: More sequence or more individuals, to combine or not?

Can separate independent data sets be combined for analysis?

How can we assess the possibility of conflict between different data?

What does the potential conflict then mean?

To combine or not to combine?

Page 10: More sequence or more individuals, to combine or not?

For instance◦ Different genes may have different phylogenetic

signal (different history?)

What is the problem?

Page 11: More sequence or more individuals, to combine or not?

If both genes have equally strong signal

Possible effects on results

Page 12: More sequence or more individuals, to combine or not?

If one gene has a stronger signal than the other

Possible effects on results

Page 13: More sequence or more individuals, to combine or not?

If one gene has a stronger signal than the other

Possible effects on results

Page 14: More sequence or more individuals, to combine or not?

Never combineCombine sometimesAlways combine

Schools of thought

Page 15: More sequence or more individuals, to combine or not?

The different data sets may represent different evolutionary histories (e.g. different selection pressures)

Big data sets dominate small data sets When analyzed separately, the different

data sets can be tests of each others phylogenetic hypotheses

Never combine!

Page 16: More sequence or more individuals, to combine or not?

Consensus trees of separate analyses

+ =

Data set A Data set B Their consensus

Page 17: More sequence or more individuals, to combine or not?

A

B

C

D

E

F

G

H

My own experience:

Page 18: More sequence or more individuals, to combine or not?

Would be fantastic to get genealogical histories of individual genes

But!◦ Single genes generally short 1000-2000 bases◦ Lots of homoplasy◦ Unreliable phylogenies

Problems with the approach

Page 19: More sequence or more individuals, to combine or not?

If the data sets are congruent, combine them

If the data sets are incongruent, don’t combine them

One can use the ILD test to decide whether data sets are incongruent

Well, sometimes you can combine...

Page 20: More sequence or more individuals, to combine or not?

If there is no conflict between data sets:◦ The length of most parsimonious tree from the

combined data [L(x+y)] is equal to the sum of the lengths of the MP trees from the separately analyzed data [L(x) + L(y)]

Dxy = L(x+y) – (L(x) + L(y))Dxy = 0

(Farris et al 1994)

ILD (Incongruence Length Difference)

Page 21: More sequence or more individuals, to combine or not?

Combining the data sets leads to increased homoplasy

But is it statistically significant? Can be tested with the Mann-Whitney U

test, where the null hypothesis is that the data sets are combinable

If Dxy > 0

Page 22: More sequence or more individuals, to combine or not?

Data set x Data set y

Data sets x + y

Data set p Data set q

Original

Combine data

Sample randomly to get equally large data sets

Page 23: More sequence or more individuals, to combine or not?

Search for MP trees and calculate Dpq values Repeat many times (e.g. 1000), which gives

us a distribution for the value of D Compare whether Dxy differs from random

distribution at P < 0.05 However:

◦ ILD-test is sensitive to relative sizes of compared data sets and to the evolutionary history of the different data sets

For the randomly generated data sets

Page 24: More sequence or more individuals, to combine or not?

But what if the conflict is only partial?

Page 25: More sequence or more individuals, to combine or not?

Combining all available data leads to more resolved trees = the combined data has higher explanatory power

”Hidden support” can only be detected through combined analysis

Conflicts at different nodes can only be discovered in a combined analysis framework

The effects of combined analysis can be investigated using indices related to Bremer support

Always combine!

Page 26: More sequence or more individuals, to combine or not?

Partitioned Bremer Support (PBS)◦ Baker & DeSalle 1997: Syst Biol 46:654

Partition Congruence Index (PCI)◦ Brower 2006: Cladistics 22:378

Hidden Bremer Support (HBS)◦ Gatesy et al 1999: Cladistics 15:271

Indices related to Bremer Support

Page 27: More sequence or more individuals, to combine or not?

The different data partitions in a data set contribute to the Bremer support in an additive way

For each node:◦ A negative Partitioned Bremer support value

indicates conflict◦ A positive Partitioned Bremer support value

indicates congruence

PBS (Partitioned Bremer Support)

Page 28: More sequence or more individuals, to combine or not?

PBS in practice

Page 29: More sequence or more individuals, to combine or not?

PBS in practice

7

7

3,4

-6,13

Page 30: More sequence or more individuals, to combine or not?

Morpholgy, COI, EF1a, Wgl

Bremer Support

Page 31: More sequence or more individuals, to combine or not?

Tells us about the magnitude of conflict between data partitions in a combined analysis

PCI is always equal to or less than BS for a given branch

PCI = BS when there is no conflict PCI is negative when there is low BS

because of strong conflicts between data partitions

Partition Congruence Index

Brower 2006: Cladistics 22:378-386

Page 32: More sequence or more individuals, to combine or not?

Underlying phylogenetic signal can be confounded by homoplasy in separate analyses

Combining datasets can bring out this signal, as homoplasy is largely random noise

Can be measured using HBS and Partitioned HBS

Hidden support

Page 33: More sequence or more individuals, to combine or not?

Hidden support can be defined as increased support for the node of interest in the simultaneous analysis of all data partitions relative to the sum of support for that node in the separate analyses of each partition

Hidden support

Page 34: More sequence or more individuals, to combine or not?

For a particular combined data set and a particular node, HBS is the difference between BS for that node in the combined analysis and the sum of BS values for that node from each data partition

Measuring hidden support

Page 35: More sequence or more individuals, to combine or not?

With a small dataset, it is probably always best to combine everything

With large datasets (10 or 20 gene regions?) one can find sets of congruent genes and combine them

But!◦ Is there a biological reason for incongruence, or is

it just a property of the data?

So, what to do?

Page 36: More sequence or more individuals, to combine or not?

Problems inherent in molecular data

Niklas Wahlberg

Page 37: More sequence or more individuals, to combine or not?

Saturation Bias in nucleotide composition Orthology vs paralogy Lineage sorting Lateral Gene Transfer

What are the problems?

Page 38: More sequence or more individuals, to combine or not?

Saturation

Page 39: More sequence or more individuals, to combine or not?

Saturation is due to multiple changes at the same site subsequent to lineage splitting

Models of evolution attempt to infer the missing information through correcting for “multiple hits”

Most data will contain some fast evolving sites which are potentially saturated (e.g. in proteins often position 3)

In severe cases the data becomes essentially random and all information about relationships can be lost

Saturation in sequence data

Page 40: More sequence or more individuals, to combine or not?

C A

C G T A1 2 3

1

Seq 1

Seq 2

Number of changes

Multiple changes at a single site - hidden changes

Ancest GGCGCGSeq 1 AGCGAGSeq 2 GCGGAC

Page 41: More sequence or more individuals, to combine or not?

Saturation

Time since divergence

Pair

wis

e d

ista

nce

ca

lcula

ted

from

sequ

ence

s

Page 42: More sequence or more individuals, to combine or not?

Homoplasy is a problem with molecular data

Elevated rates of molecular evolution in unrelated lineages

Sparse taxon sampling leading to long branches

Saturation and long branch attraction

Page 43: More sequence or more individuals, to combine or not?

The classical long-branch attraction example

Based on one gene 18S

Page 44: More sequence or more individuals, to combine or not?

Nardi et al. 2003: Science 299: 1887-1889

Page 45: More sequence or more individuals, to combine or not?

Taxon sampling is important For divergent taxa with few extant species,

can be a problem More data from different sources

◦ Could be that molecular data are not able to resolve the position of some taxa

◦ Morphological data!

Is saturation a problem?

Page 46: More sequence or more individuals, to combine or not?

Biased base composition

Page 47: More sequence or more individuals, to combine or not?

Do sequences manifest biased base compositions (e.g thermophilic convergence) or biased codon usage patterns which may obscure phylogenetic signal?

Biased base compositions?

Page 48: More sequence or more individuals, to combine or not?

% Guanine + Cytosine in 16S rRNA genes

Thermophiles:Thermotoga maritimaThermus thermophilusAquifex pyrophilus

Mesophiles:Deinococcus radioduransBacillus subtilis

626465

5555

%GCall sites

727273

5250

737071

4838

variable sites

parsimonysites

Page 49: More sequence or more individuals, to combine or not?

A case study in phylogenetic analysis:Deinococcus and Thermus

Deinococcus are radiation resistant bacteria Thermus are thermophilic bacteria

BUT:◦ Both have the same very unusual cell wall based

upon ornithine◦ Both have the same menaquinones (Mk 9)◦ Both have the same unusual polar lipids

Congruence between these complex characters supports a phylogenetic relationship between Deinococcus and Thermus

Page 50: More sequence or more individuals, to combine or not?

An appropriate method can correct for GC bias

Aquifex

Thermotoga

Deinococcus

Bacillus

Thermus

Parsimony tree

Aquifex

Thermotoga

Deinococcus

Bacillus

Thermus

Aquifex

Thermotoga

Deinococcus

Thermus

Bacillus

Jukes & Cantor Tree Log Det Tree

Page 51: More sequence or more individuals, to combine or not?

Orthology and paralogy

Page 52: More sequence or more individuals, to combine or not?

Are the sequences being generated from different species the same (homologous)?

Gene duplication◦ duplicate gene degenerates◦ duplicate gene aquires new function

A problem particular accute currently as we search for new genes

Orthology or paralogy?

Page 53: More sequence or more individuals, to combine or not?

ORTHOLOGY

Orthology: gene trees and species trees

Gene phylogeny

a

b

c

Organism phylogeny

A

B

C

Page 54: More sequence or more individuals, to combine or not?

Darwin’s theory reinterpreted homology as common ancestry.

ATCGGCCACTTTCGCGATCA

ATAGGCCACTTTCGCGATCA ATCGGCCACTTTCGCGATCG

ATAGGCCACTTTCGCGATTA ATCGGCCACTTTCGTGATCG

ATAGGGCAGTTTCGCGATTA ATCGGCCACGTTCGTGATCG

ATAGGGCAGTTTTGCGATTA ATCGGCCACGTTCGCGATCG

ATAGGGCAGTTTCGCGATTA ATCGGCCACCTTCGCGATCG

ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG

ACCGGCCACCTTCGCGATCGATAGGGCAGTCTCGCGATTA

Ancestral sequence

Homologous sequences

Page 55: More sequence or more individuals, to combine or not?

Orthologs arise by speciation

ATCGGCCACTTTCGCGATCA

ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG

Sequence in ancestralOrganism

Orthologous sequences

Speciation event

Modern species A Modern species B

Orthologs are “evolutionary counterparts” – Koonin (2001)

Page 56: More sequence or more individuals, to combine or not?

Paralogs arise by duplications

ATCGGCCACTTTCGCGATCA

ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG

Sequence in ancestralOrganism

Paralogous sequences

Duplication event

Modern duplicate A Modern duplicate B

Page 57: More sequence or more individuals, to combine or not?

An evolutionary tale…

Duplication of A in worm

Duplication of A in human

Sonnhammer & Koonin (2002) TIGs 18 619-220

Page 58: More sequence or more individuals, to combine or not?

The yeast gene is orthologous to all worm and human genes, which are all co-orthologous to the yeast gene

Evolutionary Relationships

Sonnhammer & Koonin (2002) TIGs 18 619-220

Page 59: More sequence or more individuals, to combine or not?

all genes in the HA* set are co-orthologous to all genes in the WA* set

Evolutionary Relationships

Sonnhammer & Koonin (2002) TIGs 18 619-220

Page 60: More sequence or more individuals, to combine or not?

The genes HA* are hence ‘inparalogs’ to each other when comparing human to worm.

Evolutionary Relationships

Sonnhammer & Koonin (2002) TIGs 18 619-220

Page 61: More sequence or more individuals, to combine or not?

duplication speciation

By contrast, the genes HB and HA* are ‘outparalogs’ when comparing human with worm

Evolutionary Relationships

Sonnhammer & Koonin (2002) TIGs 18 619-220

Page 62: More sequence or more individuals, to combine or not?

HB and HA*, and WB and WA* are inparalogs when comparing with yeast, because the animal–yeast split pre-dates the HA*–HB duplication

duplication

speciationEvolutionary Relationships

Sonnhammer & Koonin (2002) TIGs 18 619-220

Page 63: More sequence or more individuals, to combine or not?

PARALOGY

a1*

b1

c1*

a2

b2*

c2

Gene phylogenies Organism phylogeny

A

B

C

gene duplication

Misleading tree

A

B

C

a1

b2

c1

Paralogy can produce misleading trees

Page 64: More sequence or more individuals, to combine or not?

Ancient gene duplications can be used to root the tree of life

Ancestral Elongation Factor Gene

Gene Duplication Prior To Split Into 3 Domains Of Life

EF-Tu/ 1-alpha

EF-2/ EF-G

Sequences from one paralogue can be used to root a tree formed using sequences from the other and vice versa

= paralogues of each other

+

EF-Tu/ 1-alpha

EF-2/ EF-G

Page 65: More sequence or more individuals, to combine or not?

Lineage sorting

Page 66: More sequence or more individuals, to combine or not?

Gene trees may not be the same as species trees

Extant populations may retain ancestral polymorphisms

Species level phylogenies should never sample single individuals of different species

Lineage sorting

Page 67: More sequence or more individuals, to combine or not?

Implicit assumption in many studies using mtDNA

The mode of speciation can now be studied using DNA sequences

Theoretical studies predict that DNA lineages pass through several phases in a species

Are species monophyletic?

Page 68: More sequence or more individuals, to combine or not?

Time

A B

Ancestral gene pool

The assumption: monophyly

Page 69: More sequence or more individuals, to combine or not?

Time

A BThe assumption: monophyly

Page 70: More sequence or more individuals, to combine or not?

Paraphyly can occur when one population in a set of locally panmictic populations speciates

Polyphyly occurs when a highly polymorphic population is subdivided

Can be highly informative of the history of divergence

The presence of poly- and paraphyletic lineages

Page 71: More sequence or more individuals, to combine or not?

Time

A B

Ancestral gene pool

Paraphyly

Page 72: More sequence or more individuals, to combine or not?

Time

A BParaphyly

Page 73: More sequence or more individuals, to combine or not?

Time

A BPolyphyly

Page 74: More sequence or more individuals, to combine or not?

Time

A BPolyphyly

Page 75: More sequence or more individuals, to combine or not?

Polyphyly

Page 76: More sequence or more individuals, to combine or not?

tharos orantain (35-6) CO4

tharos riocolorado (35-9) CO8

tharos tharos (47-3) MNtharos orantain (52-9) AB4

tharos orantain (47-2) CO7, (60-6, 60-7) AB6

batesii apsaalooke (35-8) WYcocyta selenis (47-12) CO1

pulchella pulchella (47-6, 49-14, 50-6) CA3pulchella pulchella (49-13) CA3

phaon phaon (25-17) FLphaon jalapeno (35-11) Mexico

mylitta mylitta (32-3) NVmylitta mylitta (32-6) MT

mylitta arizonensis (32-1) AZ1, (47-1) NM

orseis orseis (37-1) CA1

pallida pallida (34-6, 47-9, 47-10, 47-11) CO3

mylitta mylitta (11-10, 11-11, 58-1, 58-2) BC1

pallida barnesi (58-5, 58-6) BC1

picta canace (44-11, 44-12) AZ

vesta (41-1) TXvesta (41-2) TX

picta picta (34-7) CO

batesii lakota (35-4) NEpulchella camillus (48-8, 49-12) CO1

pulchella camillus (48-14) CO1

pulchella camillus (49-3) CO6

pulchella camillus (49-5) CO6pulchella camillus (50-3) CO1

pulchella camillus (50-4) CO1

pulchella tutchone (23-11) Alaska

pulchella montana (27-5) CA2

pulchella owimba (56-1, 56-5, 56-7, 60-2) BC2

pulchella owimba (52-14, 55-7) AB5pulchella owimba (54-1) AB5

cocyta selenis (11-5) BC1pulchella owimba (24-10) MT

cocyta selenis (47-13) CO1cocyta selenis (48-3) CO1

cocyta selenis (58-8) BC1

batesii maconensis (60-13, 60-15) NC

tharos tharos (25-18) FLtharos tharos (34-2) MN

tharos tharos (44-1) NY

tharos tharos (44-2) NYtharos tharos (44-3, 44-4) NY

tharos tharos (47-4) MNtharos tharos (47-8) MN

tharos tharos (53-8) MD

tharos tharos (54-9) MD

cocyta selenis (11-4) BC1, (55-8) AB7

cocyta selenis (48-10) CO1cocyta (49-8) MNdiminutor

cocyta selenis (11-6) BC1

batesii lakota (60-5) AB6

probably (52-2) AB1batesii lakotacocyta selenis (55-6) AB6

batesii anasazi (34-1) CO2cocyta selenis (47-14, 48-6) CO1

cocyta (49-9) MNdiminutor

batesii lakota (52-7, 52-8) AB3

cocyta selenis (55-2) AB7

cocyta selenis (60-12) BC2cocyta selenis (58-7) BC1

pulchella camillus (35-5, 48-2, 48-7, 48-9, 48-13) CO1, (50-2) NM

pulchella camillus (48-4) CO5pulchella camillus (49-1) NMpulchella camillus (49-2) CO6

pulchella camillus (49-4) CO6

orseis orseis (67-3) CA1

orseis orseis (67-4) CA1orseis orseis (67-6) CA1

vesta (67-9) Mexico

pallescens (64-2) Mexicopallescens (64-1) Mexico

mylitta arida (67-10) Mexico

cocyta cocyta (72-8) ONT

tharos distincta (73-4) Mexico

cocyta cocyta (72-9) ONTbatesii batesii (73-9) MNbatesii batesii (72-1) ONT

batesii maconensis (69-1, 69-2) NC

cocyta cocyta (72-10) ONT

pulchella montana (67-15) ORpulchella montana (67-16) OR

pulchella inornata (67-11) OR

pulchella inornata (67-13) ORpulchella inornata (67-14) OR

pulchella inornata (73-1) ORpulchella inornata (73-2) OR

95

100100

99100

100

100

10073

51

8086

7163

91

100

88

56

52

74

7862

100

95

99100

62

74

68

6152

91

80

7275

8968

10062

99

88

72

77

An empirical example:

Phyciodes butterflies

Wahlberg et al. 2003. Syst Ent 28:257-273

Page 77: More sequence or more individuals, to combine or not?

Paraphyly of a species can be due to incomplete lineage sorting and/or secondary gene flow

Page 78: More sequence or more individuals, to combine or not?

G = generations, starting with ten unrelated females at G = 0

Page 79: More sequence or more individuals, to combine or not?

Lateral gene transfer

Page 80: More sequence or more individuals, to combine or not?

Widely spread in single celled organisms◦ Even between distantly related lineages

In multi-celled organisms more a problem in closely related species◦ hybridization

Lateral Gene Transfer

Page 81: More sequence or more individuals, to combine or not?

Is the Tree of Life really a Web of Life?

Lateral Gene Transfer

Page 82: More sequence or more individuals, to combine or not?

These ”problems” are highly interesting phenomena in themselves!

When taking the different factors into account, can be informative about evolutionary history

”When in doubt, get more data”- Brooks and McLennan 2002

Problems inherent in molecular data?