michel veuille ecole pratique des hautes etudes director of the systematics and evolution dept...
TRANSCRIPT
Michel VeuilleEcole pratique des Hautes Etudes
Director of the Systematics and Evolution deptMuséum National d’Histoire Naturelle
Paris
Scientific Advisory Board of the CBOL
Data Analysis Working Group
What is the molecular signature of speciation events?
There is no molecular signature of speciation events
What are the other signatures of speciation events?
There is no universal signature of speciation events
But there are local signatures of speciation events,and one kind of signature (e.g. morphological) can be present when the other (e.g. genetical) is absent
In 1998, the common European earwig was shown to consist of two sympatric and reproductively isolated species differing only in the number of annual broods (one or two broods per year).
Wirth, Le Guellec, Vancassel, & Veuille. 1998. Evolution 52: 260-265Wirth, Le Guellec, & M. Veuille. 1999 MBE, 16: 1645-1653.
A case of two mtDNA specieswith no morphological difference
The two species differ strikingly in COII sequence
But since they present no apparent morphological difference, the two species remain unnamed
Two examples : 1st / 2
European earwig Forficula auricularia
This is because the GC% of these species evolves at a very high rate
GC% at COII in hexapoda
earwigs
Other hexapoda
Drosophila santomea lives in the highlands of São Tome above 1100 mDrosophila yakuba lives in the lowlands, below 1100 m.
After Lachaise et al. Proc. Roy Soc. London, 2000
A case of two morphological specieswith no mtDNA difference
Two examples : 2nd / 2
Drosophila santomea Drosophila yakuba
São Tome
They hybridize at 1100 m, and nevertheless remain genetically distinct
They share the same mitochondria, but can be easily identified through the colour pattern of the abdomen
1830 Tropical Africa + worldwide
D. erecta
D. teissieri
D. yakuba
D. santomea
D. melanogaster
D. simulans
D. mauritiana
D. sechellia
D. orena
2000 São Tome island
1919 Tropical Africa + worldwide
1978 Cameroon
1974 Tropical Africa
1971 Tropical Africa
1954 Tropical Africa
1974 Mauritius island1981 Sechelles islands
D. santomea D. yakuba
Share the same mitochondrion through common descent
They belong to the Drosophila melanogaster ("black abdomen") subgroup
There are many definitions of species
The species concept is hotly debated
The condition of the barcoder is challenging
« Species » make sense to everybody.
For example, 12% of the nouns in the French vocabulary* correspond to taxa that make sense to a taxonomist (species, families, varieties)
* : From the Robert a classic French dictionary
A solution is to let people use whatever species concept they prefer
and limit the barcoder’s activity to the domain where he/she can be helpful
?0,000,000 species Black boxData & tools
« This is species A or B »
« This is a new species »
Data analysis consists in providing data to taxonomists, in order to make decisions about the status of specimens and taxa.
(taxonomist)(barcoder)
Barcoding and taxonomic decisions are logically distinct, even though they can be performed by the same person.
What data analysis is about
Query sequence
closest validated node
Tree of life
Local barcode
sister group
closest COI validated node
Tree of life
Local barcode
Closest validated node using additional information
If we want to be 100% sure of the assignment of a taxon, then we must look at the nodes below the closest node excluding a sister group with probability p < 0.01.
Below this point, a series of statistical and classificatory approaches allow us to estimate the probability that the query sequence belongs or not to an already described species, based on the available information.
Alternatively, additional information using other genes, or an enlarged dataset can increase our understanding of the taxonomic status of the query.
What data analysis is about (contd)
The population genetics background behind data analysis
Principletwo sequences from the same population find their last common ancestor with some constant probabiilty p = 1/N It is a « death process » Very different from a normal distribution
The most probable coalescence time: t = 1
the expectation: t = N
P = 0.05 for: t = 3N
Past (generations)
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
2 399 19
MRCA
Sample n1
n
p
Probability p that the MRCA of a sample of size n is also the MRCA of the speciesassuming a standard Wright-Fisher model.
p increases very rapidly. The probability is p = 0.6667 for n = 5, and p = 0.8 for p = 9Increasing the sample size beyond this is useless
In a very large population p = (n-1)/(n+1)
MRCA
Sample n1
N generations
2N (1-1/n)generations
Typically, under a standard equilibrium Wright-Fisher model(*) , the expected time to the last common ancestor of the tree (MRCA) is only twice the time to the common ancestor of two randomly sampled sequences
(*) assuming :- neutrality - constant population size- no structuring - mutation drift-equilibrium- N = effective number of genes
MRCA
Sample n1 Sample n2 > n1
MRCA
« The older nodes of a genealogy tend to be revealed in a small sample, whereas more recent portions are, on average, only revealed as the sample size per locus grows large. »
Kliman et al. 2000.
N generations
2N (1-1/n)generations
Using a larger dataset does not increase the information very much at this level
After AG Clark 1997
A long time after they have split, two species still share some neutral polymorphisms.
polymorphisms can go very far, back in the past of the species, and enter the ancestral population with a sister species
Exploring shallow nodes
Derived from Nielsen and Hey’s (2001) IM method, based on MCMC(Monte Carlo Markov chains).
This method estimated 5 Parameters, thus involving very long computation time
1. Nielsen and Matzen’s MCMC method
1. Matz and Nielsen’s MCMC method
Derived from Nielsen and Hey’s (2001) IM method, based on MCMC(Monte Carlo Markov chains).
This method estimated 5 Parameters, thus involving very long computation time
Matz and Nielsen (2005) reduce it to two parameters:- the population size- time to speciation.
They estimate the probability that the query sequence belongs or not to the same species as the reference sample
The classification methods partition the dataset using a few characters
The distance methods work well with a small dataset, provided there are enough mutations
2. Evaluating classification and phylogenetic methods : Austerlitz et al.
They compare two classification methods CARTrandom forest
And two phylogenetic methodsNeighbour-joiningphy-ML
They simulate n +1 individuals in each species.
n individuals are a reference sample
the last individual is the query.
Repeated simulations, allow them to record the rate of
correct assigment of the query to its species
Comparison of the methods for a low
(2 populations, reference sample size = 10, )
50%
60%
70%
80%
90%
100%
100 1000 10000
Separation time
succ
ess
rate
mlcartRF
Classification methods perform better for a low variation
Comparison of the methods for a high
(2 populations, Reference sample size = 10, θ = 30)
50%
60%
70%
80%
90%
100%
100 1000 10000
Separation time
succ
ess
rate
mlCARTRF
Phylogenetic methods perform better for a highly variable population
Conclusion :
the appropriate method varies with the properties of the dataset
Comparing methods using realistic datasets
1. Litoria nannotis
2. Astraptes fulgeraptor
80.00%
85.00%
90.00%
95.00%
100.00%
0 5 10 15 20 25 30
succ
ess
rate
sample size
ML
CART
Random Forest
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
100%
3 4 5 6 7 8 9 10
Reference Sample size
Go
od
ass
ign
men
t ra
te
phylo
CART
4 speciesAverage sample size: 43.7average = 1.54
12 speciesAverage sample size: 38.8average = 23.5
3. Cowries
80.00%
85.00%
90.00%
95.00%
100.00%
0 5 10 15 20 25 30
sample size
good
ass
ignm
ent r
ate
MLCARTRandom Forest
Other solutions:
Can we replace CO1 ?Can we complement it with other genes
Properties of bilaterian mtDNA Other systems
Large number of copies per cell rDNA has a high copy number
High mutation rate
Low variation / divergence ratio
No recombination
asexual
Haploid X-chromosome, Y chromosome
Centromeres, telomeres (documented in Drosophila)
Microsatellites also
Centromeres, telomeres (documented in Drosophila)
The Y is asexualThe other chromosomes recombine
Maternally inherited
The main disadvantage of asexuality is that mitochondria do not follow the 2nd law of Mendel :
mtDNA carries no information on genetic barriers..
The main disadvantage of maternal inheritance is that mitochondria can be transferred horizontally along with Wolbachia endosymbiotic bacteria. Examples: Protocalliphora and Drosophila
Variation in mtDNA is lowered due to selective sweeps according to Bazin et al (2006)Variation is also lowered in some nuclear regions due to background selection
Phylogeny of the fly Protocalliphora based on AFLP (nuclear markers),according to Whitworth et al (2007).
Symbols represent different Wolbachia strains
Maternally transmitted endosymbiotic bacteria : hitchhiking by Wolbachia
Phylogeny of Protocalliphora based on COI+COII.The authors claim that the assignment of unknown individuals to species is impossible in 60% of the species
After Whitworth et al. Proc Roy. Soc. B, in press
nuclear
mtDNA
MRCA
Phylogenetic tree of mtDNA Phylogram of nuclear DNA
A phyletic tree in mtDNA represents true phyletic relationships.Mutations are in linkage disequilibrium because they do not recombine.Having two divergent clades is trivial under a FW standard model
Whereas the phylogram of a recombining gene represents distances between haplotypes,where mutations can seem to « appear » repeatedly on several terminal branches.
They thus inform us on the existence of barrier to gene flow
Conclusions
1. There is no mitochondrial signature of speciation. There is no room for a barcode species concept, and anything like a « barcodon ».
2. Even a moderate sample can provide a wealth of information on the history of a species.
3. Additional information can be obtained in difficult cases, either by increasing the population sample, or by using additional markers.
The END