models of gene duplication, transfer and loss to study genome evolution

83
Bastien Boussau LBBE, CNRS, Université de Lyon Models of gene duplication, transfer and loss to study genome evolution

Upload: boussau

Post on 29-Jul-2015

164 views

Category:

Science


0 download

TRANSCRIPT

Bastien Boussau

LBBE, CNRS, Université de Lyon

Models of gene duplication, transfer and loss to study genome evolution

CollaboratorsLyon collaborators:

• Adrián Arellano Davín

• Gergely Szöllősi (Budapest)

• Vincent Daubin

• Eric Tannier

• Thomas Bigot

• Magali Semeria

• Manolo Gouy

• Laurent Duret

• Nicolas Lartillot

Austin/Illinois collaborators:

• Siavash Mirarab

• Md. Shamsuzzoha Bayzid

• Tandy Warnow

RevBayes collaborators:

• Sebastian Hoehna • Michael Landis • Tracy Heath • Fredrik Ronquist • Brian Moore • John Huelsenbeck • …

Plan

1. Gene duplications and losses

• Mammalian genomes

2. Gene duplications, losses and transfers

• Fungi and Cyanobacteria

3. A fast approach to dealing with incomplete lineage sorting

• Birds

4. 2 vignettes

To study genome evolution:

1. One species tree:

!!!

2. Thousands of gene trees:

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

To study genome evolution:

1. One species tree:

!!!

2. Thousands of gene trees:

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Why  our  current  pipeline  can  be  improved

�������������

��������

� ���������

�������� �

�������������

���������������

��������

�������������� ���������������������

���������������������� ������������ ���������������

����������� !���"� !��#����!�#$��%

���������&$�%!�������������'(%!�#$�%

�������( )'�

����!�����*+ ('�,#$��%

����!��������&�����-���!�����&( ��� $�.��"'(%

���������/���

Why  our  current  pipeline  can  be  improved

�������������

��������

� ���������

�������� �

�������������

���������������

��������

�������������� ���������������������

���������������������� ������������ ���������������

����������� !���"� !��#����!�#$��%

���������&$�%!�������������'(%!�#$�%

�������( )'�

����!�����*+ ('�,#$��%

����!��������&�����-���!�����&( ��� $�.��"'(%

���������/���

•Gene  alignments:  •Error  prone  (Genes  are  short)  •Point  es:mates  

Why  our  current  pipeline  can  be  improved

�������������

��������

� ���������

�������� �

�������������

���������������

��������

�������������� ���������������������

���������������������� ������������ ���������������

����������� !���"� !��#����!�#$��%

���������&$�%!�������������'(%!�#$�%

�������( )'�

����!�����*+ ('�,#$��%

����!��������&�����-���!�����&( ��� $�.��"'(%

���������/���

•Gene  trees:  •based  on  alignments  •Point  es:mates  

•Gene  alignments:  •Error  prone  (Genes  are  short)  •Point  es:mates  

Why  our  current  pipeline  can  be  improved

�������������

��������

� ���������

�������� �

�������������

���������������

��������

�������������� ���������������������

���������������������� ������������ ���������������

����������� !���"� !��#����!�#$��%

���������&$�%!�������������'(%!�#$�%

�������( )'�

����!�����*+ ('�,#$��%

����!��������&�����-���!�����&( ��� $�.��"'(%

���������/���

•Gene  trees:  •based  on  alignments  •Point  es:mates  

•Species  trees:  •based  on  gene  trees  

•Gene  alignments:  •Error  prone  (Genes  are  short)  •Point  es:mates  

Why  our  current  pipeline  can  be  improved

�������������

��������

� ���������

�������� �

�������������

���������������

��������

�������������� ���������������������

���������������������� ������������ ���������������

����������� !���"� !��#����!�#$��%

���������&$�%!�������������'(%!�#$�%

�������( )'�

����!�����*+ ('�,#$��%

����!��������&�����-���!�����&( ��� $�.��"'(%

���������/���

•Gene  trees:  •based  on  alignments  •Point  es:mates  

•Species  trees:  •based  on  gene  trees  

•Gene  alignments:  •Error  prone  (Genes  are  short)  •Point  es:mates  

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

D

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

D DL

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

LGTD DL

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

LGT ILSD DL

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

LGT ILS

DL: Boussau et al., Genome Research 2013

D DL

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

LGT ILS

DL: Boussau et al., Genome Research 2013

D DL

DL+T:!Szöllősi et al. "PNAS 2013

Species: A B C D

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

LGT ILS

ILS: !Mirarab et al. Science 2014

DL: Boussau et al., Genome Research 2013

D DL

DL+T:!Szöllősi et al. "PNAS 2013

(thousands  of  alignments)

PHYLDOG

All gene families

Rooted species tree,numbers of duplications

and losses,rooted gene trees D1

D2

D3D4

D5

D6

L2L1

L4L3

L5

L6

Joint reconstruction of the species tree, gene trees, and

numbers of duplications and losses

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

D1D3

D2 D4

D5 D6

L1L3

L2 L4

L5 L6

Boussau et al., Genome Research 2013

(thousands  of  alignments)

PHYLDOG

All gene families

Rooted species tree,numbers of duplications

and losses,rooted gene trees D1

D2

D3D4

D5

D6

L2L1

L4L3

L5

L6

Joint reconstruction of the species tree, gene trees, and

numbers of duplications and losses

Species: A B C D

Discrete character:Continuous character:

a a b a0.1 0.2 0.2 0.4

TIME

D1D3

D2 D4

D5 D6

L1L3

L2 L4

L5 L6

Probabilis5c  models:  • sequence  evolu1on  • gene  family  evolu1on

Boussau et al., Genome Research 2013

PHYLDOG: a model of gene duplication and loss

Assumptions!•Genes evolve along the species tree:!•birth events:!•duplications (rate of duplication)!

•death events:!•losses (rate of loss)!

•Each gene family is independent of other genes!•Each gene copy is independent of other copies!!!

Study  of  mammalian  genome  evolu:on

10

• Challenging  but  well-­‐studied  phylogeny  

• 36  mammalian  genomes  available  in  Ensembl  v.  57  

• About  7000  gene  families  

• Correc:on  for  poorly  sequenced  genomes

PHYLDOG finds a good species tree

Sus scrofa

Felis catus

Ornithorhynchus anatinus

Oryctolagus cuniculus

Loxodonta africana

Mus musculus

Gorilla gorilla

Dipodomys ordii

Monodelphis domestica

Vicugna pacos

Macaca mulatta

Tupaia belangeri

Procavia capensis

Spermophilus tridecemlineatus

Pongo pygmaeus

Tursiops truncatus

Microcebus murinus

Callithrix jacchus

Equus caballus

Erinaceus europaeus

Tarsius syrichta

Choloepus hoffmanni

Ochotona princeps

Cavia porcellus

Pan troglodytes

Bos taurus

Rattus norvegicus

Homo sapiens

Otolemur garnettii

Dasypus novemcinctusEchinops telfairi

Pteropus vampyrus

Macropus eugenii

Canis familiaris

Sorex araneus

Myotis lucifugus

Laurasiatheria

Afrotheria

Xenarthra

Marsupials

Primates

Glires

Quality  of  the  gene  trees

12

Comparison  between:  PhyML  (used  for  the  PhylomeDB  and  Homolens  databases  )  TreeBeST  (used  for  the  Ensembl-­‐Compara  database)  PHYLDOG

Two  approaches:  • Looking  at  ancestral  genome  sizes  • Assessing  how  well  one  can  recover  ancestral  syntenies  

using  reconstructed  gene  trees  (Bérard  et  al.,  Bioinforma:cs  2012)

Sus scrofa

Felis catus

Ornithorhynchus anatinus

Oryctolagus cuniculus

Loxodonta africana

Mus musculus

Gorilla gorilla

Dipodomys ordii

Monodelphis domestica

Vicugna pacos

Macaca mulatta

Tupaia belangeri

Procavia capensis

Spermophilus tridecemlineatus

Pongo pygmaeus

Tursiops truncatus

Microcebus murinus

Callithrix jacchus

Equus caballus

Erinaceus europaeus

Tarsius syrichta

Choloepus hoffmanni

Ochotona princeps

Cavia porcellus

Pan troglodytes

Bos taurus

Rattus norvegicus

Homo sapiens

Otolemur garnettii

Dasypus novemcinctusEchinops telfairi

Pteropus vampyrus

Macropus eugenii

Canis familiaris

Sorex araneus

Myotis lucifugus

Laurasiatheria

Afrotheria

Xenarthra

Marsupials

Primates

Glires

010

000

010

000

010

000

010

000

010

000

010

000

010

000PHYLDOG

TreeBeSTPhyML

PHYLDOG: better trees for better ancestral genomes

An example gene family

0.1

Ornithorhynchus anatinus

0.3

Ornithorhynchus anatinusMus musculusMus musculusMus musculusCavia porcellusMus musculus

Oryctolagus cuniculusCanis familiaris

Bos taurusHomo sapiens

Pongo pygmaeusOryctolagus cuniculus

Cavia porcellusEquus caballusEquus caballus

Bos taurusCallithrix jacchusHomo sapiens

Monodelphis domesticaSpermophilus tridecemlineatus

Homo sapiensOrnithorhynchus anatinusOrnithorhynchus anatinusOrnithorhynchus anatinusOrnithorhynchus anatinus

Mus musculusMus musculus

Ornithorhynchus anatinusOrnithorhynchus anatinus

Mus musculusMus musculusMus musculus

Cavia porcellus

Mus musculus

Oryctolagus cuniculus

Canis familiaris

Bos taurus

Homo sapiens

Pongo pygmaeus

Oryctolagus cuniculus

Cavia porcellus

Equus caballusEquus caballus

Bos taurus

Callithrix jacchusHomo sapiens

Monodelphis domestica

Spermophilus tridecemlineatus

Homo sapiens

Ornithorhynchus anatinusOrnithorhynchus anatinusOrnithorhynchus anatinusOrnithorhynchus anatinus

Mus musculusMus musculus

TreeBeST PHYLDOG

Boussau et al., Genome Research 2013

Recent improvements to PHYLDOG

• Easier installation using Cmake or a virtual machine!• Better algorithms for gene tree inference!• Better algorithm for starting species tree!• Faster computations using the Phylogenetic Likelihood Library

(PLL, A. Stamatakis group)!• Python scripts to help run the program

Plan

1. Gene duplications and losses

• Mammalian genomes

2. Gene duplications, losses and transfers

• Fungi and Cyanobacteria

3. A fast approach to dealing with incomplete lineage sorting

• Birds

4. 2 vignettes

Species: A B C D

TIME

ILS: !Mirarab et al. Science 2014

DL: Boussau et al., Genome Research 2013DL+T:!

Szöllősi et al. "PNAS 2013

Species: A B C D

TIME

LGT ILS

ILS: !Mirarab et al. Science 2014

DL: Boussau et al., Genome Research 2013

D DL

DL+T:!Szöllősi et al. "PNAS 2013

Gene  transfers  and  the  quixo:c  pursuit  of  the  TOL

DooliYle  WF,  

 Science  1999

Gene  transfers  and  the  quixo:c  pursuit  of  the  TOL

DooliYle  WF,  

 Science  1999

Gene  transfers  and  the  quixo:c  pursuit  of  the  TOL

DooliYle  WF,  

 Science  1999

“The monistic concept of a single universal tree appears […] increasingly obsolete. […][It is] no longer the most scientifically productive position to hold[…][It] accounts for only a minority of observations from genomes.”!

Bapteste, O’Malley, Beiko, Ereshefsky, Gogarten, Franklin-Hall, Lapointe, Dupré, Dagan, Boucher, Martin, !

Biology Direct 2009.

exODT: a model of gene duplication, transfer, and loss

Assumptions!•Genes evolve along the species tree:!•birth events:!•duplications (rate of duplication)!•transfers (rate of receiving a gene)!

•death events:!•losses (rate of loss)!

•Each gene family is independent of other genes!•Each gene copy is independent of other copies!•Transfers can go through unsampled/extinct species!!!

exODT: a model of gene duplication, transfer, and loss

Szöllősi et al., Syst. Biol. a 2013

exODT: a model of gene duplication, transfer, and loss

Szöllősi et al., Syst. Biol. a 2013

Better gene trees, fewer transfers

Usual approach

ALE+DTL

RF d

ista

nce

to re

al tr

ee

Szöllősi et al., Syst. Biol. b 2013

Better gene trees, fewer transfers

Usual approach

ALE+DTL

Tran

sfer

eve

nts

per f

amily

Usual approach

ALE+DTL

RF d

ista

nce

to re

al tr

ee

Szöllősi et al., Syst. Biol. b 2013

Application to real data: Cyanobacteria and Fungi

Cyanobacteria!• > 2.4 billion years old! !• 40 species!• 1,200 to 4,500 protein coding genes!• 7,410 gene families!

!Fungi (Dikarya)!• ~ 1 billion years old!• 28 species!• 5,200 to 10,000 protein coding genes!• 11,387 gene families!!!

Both cases: !• fixed species tree, gene trees inferred using the

Duplication, Transfer and Loss model! Szöllősi et al., under review

Application to real data: Cyanobacteria and Fungi

Cyanobacteria!• > 2.4 billion years old! !• 40 species!• 1,200 to 4,500 protein coding genes!• 7,410 gene families!

!Fungi (Dikarya)!• ~ 1 billion years old!• 28 species!• 5,200 to 10,000 protein coding genes!• 11,387 gene families!!!

Both cases: !• fixed species tree, gene trees inferred using the

Duplication, Transfer and Loss model!

Transfers are expected

Transfers should be less frequent

Szöllősi et al., under review

Cyanobacteria

Szöllősi et al., under review

Cyanobacteria

Szöllősi et al., under review

Cyanobacteria

0.18 transfer per gene

Szöllősi et al., under review

Fungi

Szöllősi et al., under review

Fungi

Szöllősi et al., under review

Fungi

0.07 transfer per gene

Szöllősi et al., under review

Comparing transfer rates

• Cyanobacteria and Fungi differ in their age:!!

We can compare normalized numbers of events:!T/(T+D)!

!

• The Cyanobacteria and Fungi data sets differ in their number of species:!

!We can perform rarefaction studies

Szöllősi et al., under review

Comparing transfer rates

Szöllősi et al., under review

Similar transfer rates in Fungi and Cyanobacteria

Szöllősi et al., under review

Using transfers to date clades

?T IM E

Using transfers to date clades

?T IM E

Using transfers to date clades

?T IM E

Using transfers to date clades

?T IM E

Using transfers to date clades

?T IM E

Using transfers to date clades

?T IM E

Using transfers to date clades

?T IM E

Because we can identify gene transfers, we have information for ordering the nodes of a species tree

Bayesian species tree inference

accounting for DTL events

• STRALE: • A Bayesian probabilistic method that can interpret thousands of

gene trees in terms of: • speciation events • duplication events (D) • transfer events (T) • loss events (L)

• A method able to estimate the DTL rates • A method able to reconstruct the species tree • A method able to order the nodes of the species tree

Simulation to test the species tree reconstruction• 20 species • 200 gene families

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

1 5

1

3

14

10

6

8

12

18

13

5

4

2

9

0

11

19

7

16

17

0.0 0.25 0.5 0.75 1.0 1.25

2

13

7

17

15

1

5

12

10

16

11

9

0

4

8

3

14

19

6

18

Simulated Inferred

Conclusion on DTL models

• The use of DTL models shows that the number of gene transfers has so far been overestimated

• DTL models can be used to study genome evolution and in particular rates of gene transfer

• DTL models can be used to date the nodes of a species phylogeny

• DTL models should provide a powerful tool to infer an accurate account of the history of life

Plan

1. Gene duplications and losses

• Mammalian genomes

2. Gene duplications, losses and transfers

• Fungi and Cyanobacteria

3. A fast approach to dealing with incomplete lineage sorting

• Birds

4. 2 vignettes

Species: A B C D

TIME

ILS: !Mirarab et al. Science 2014

DL: Boussau et al., Genome Research 2013DL+T:!

Szöllősi et al. "PNAS 2013

Species: A B C D

TIME

LGT ILS

ILS: !Mirarab et al. Science 2014

DL: Boussau et al., Genome Research 2013

D DL

DL+T:!Szöllősi et al. "PNAS 2013

35

The multispecies coalescent

Rannala and Yang, Genetics 2003

• Divergence times in the species tree!• Divergence times in the gene trees!• Effective population sizes in the species tree

Faster alternatives to the multispecies coalescent use fixed gene trees

E.g.: MP-EST (Liu, Yu and Edwards, 2010)!Input: fixed gene trees!Output: species tree with branch lengths in coalescent units!!Has been shown to be consistent, under one notable assumption: !gene trees are correct.

Errors in gene trees decrease the accuracy of estimated species trees

Mirarab, Bayzid and Warnow, Syst. Biol 2014

38

Statistical binning

Mirarab et al., Science 2014

38

Statistical binning

Mirarab et al., Science 2014

MP-EST

39

Statistical binning

Mirarab et al., Science 2014

MP-EST

39

Statistical binning

Mirarab et al., Science 2014

MP-EST

MP-EST

40

Statistical binning improves

species tree inference

Mirarab et al., Science 2014

41

Statistical binning also improves the estimation of the gene tree distribution

Mirarab et al., Science 2014

42

Jarvis et al., Science 2014Statistical binning and birds

43Mirarab et al., PLoS One, accepted

Improving statistical binning: weighted statistical binning

44Mirarab et al., PLoS One, accepted

Improving statistical binning: weighted statistical binning

Practice: weighted binning and unweighted binning have about the same accuracy !

Theory: weighted statistical binning can be shown to be consistent, unweighted statistical binning is not.

Plan

1. Gene duplications and losses

• Mammalian genomes

2. Gene duplications, losses and transfers

• Fungi and Cyanobacteria

3. A fast approach to dealing with incomplete lineage sorting

• Birds

4. 2 vignettes

RevBayes

• R-like language

• Model-based phylogenetics

• Many models of sequence evolution

• Models for dating

• Models for phylogeography

• Models for continuous traits

• Models for gene tree/species tree inference

• http://revbayes.net

• Sebastian Hoehna • Michael Landis • Tracy Heath • Fredrik Ronquist • Nicolas Lartillot • Brian Moore • John Huelsenbeck • …

One more thing..

One more thing..

One more thing..

Conclusions

• We develop methods for gene tree and species tree inference

• Improvement of gene trees and species trees in the presence of:

• duplications and losses,

• transfers,

• incomplete lineage sorting

• Parallel algorithms applicable to genome-scale data

• We study the evolution of life, ancient and recent

RevBayes collaborators:

• Sebastian Hoehna • Michael Landis • Tracy Heath • Fredrik Ronquist • Brian Moore • John Huelsenbeck • …

Lyon collaborators:

• Adrián Arellano Davín

• Gergely Szöllősi (Budapest)

• Vincent Daubin

• Eric Tannier

• Thomas Bigot

• Magali Semeria

• Manolo Gouy

• Laurent Duret

• Nicolas Lartillot

Austin/Illinois collaborators:

• Siavash Mirarab

• Md. Shamsuzzoha Bayzid

• Tandy Warnow

Thanks!