what happened to my genes? insights on gene family dynamics

8
What happened to my genes? Insights on gene family dynamics from digital genetics experiments C. Knibbe 1,2 and D. P. Parsons 1 1 INRIA Rh ˆ one-Alpes, Montbonnot F-38322, France 2 Universit´ e de Lyon, Universit´ e Lyon 1, CNRS, UMR5205, LIRIS, Villeurbanne, F-69622, France [email protected] Abstract Gene families are sets of homologous genes formed by du- plications of a single original gene. Inferring their history in terms of gene duplications, gene losses and gene mutations yields fundamental insights into the molecular basis of evo- lution. However, phylogenetic inference of gene family evo- lution faces two difficulties: (i) the delimitation of gene fam- ilies based on sequence similarity, and (ii) the fact that the models of evolution used for reconstruction are tested against simulated data that are produced by the model itself. Here, we show that digital genetics, or in silico experimental evo- lution, can provide thought-provoking synthetic gene family data, robust to rearrangements in gene sequences and, most importantly, not biased by where and how we think natural selection should act. Using aevol, a digital genetics model with an abstract phenotype but a realistic genome structure, we analyzed the evolution of 3,512 synthetic gene families under directional selection. The turnover of gene families in evolutionary runs was such that only 21% of those fami- lies would be accessible for classical phylogenetic inference. Extinct families showed patterns different from the final, ob- servable ones, both in terms of dynamics of gene gains and losses and in terms of gene sequence evolution. This study also reveals that gene sequence evolution, and thus evolution- ary innovation, occurred not only through local mutations, but also through chromosomal rearrangements that re-assembled parts of existing genes. Introduction How do new genes arise? Do they evolve mainly by local mutations or domain shuffling? Which events drive them to extinction? These questions are fundamental to understand the evolutionary dynamics of living systems. Because the preservation of soft tissues is rare in fossil records, paleon- tology provides precious but limited knowledge of the past of the living world. The study of molecular evolution thus largely relies on the analysis of extant genes, which may or may not be a representative sample of genetic diversity throughout the evolution of life. Central in this analysis is the notion of gene family, defined as a set of homologous genes formed by duplications of a single original gene. In- sights into the evolutionary dynamics at the molecular level are obtained by inferring the evolutionary history of gene gains and losses and of gene mutations in a gene family. The usual strategy to identify gene families consists in detecting significant sequence similarities in gene or pro- tein sequences. This method is inherently biased towards the detection of families that evolve mainly through local mutations rather than domain shuffling. As Song et al. (2008) make it clear, “multidomain sequences, especially those with promiscuous domains that occur in many con- texts, are frequently excluded from genomic analyses due to the lack of a theoretical framework and practical methods for detecting multidomain homologs”. Efforts are thus ongo- ing to develop multidomain homology identification meth- ods (Geer et al., 2002; Enright et al., 2002; Lin et al., 2006; Song et al., 2008; Jachiet et al., 2013). Once gene families have been identified, their evolution- ary histories are inferred, using implicit or explicit models of evolution to describe the patterns of DNA base substitu- tion and amino acid replacement (Li` o and Goldman, 1998) and the patterns of gene gains and losses (Arvestad et al., 2004; Vilella et al., 2008; Akerborg et al., 2009; Rasmussen and Kellis, 2012; Boussau et al., 2013). These models of evolution make assumptions – for example, the model used by Vilella et al. (2008) assumes that gene duplications and deletions are rare events, and that duplication followed by complementary gene losses on the left and right branches of a duplication node is an unlikely scenario. Most mod- els also assume that different gene families evolve indepen- dently, while a single duplication or deletion can actually span several genes. The scarcity of well-preserved ancient DNA samples makes it difficult to really test these hypothe- ses. The common practice to test a phylogenetic method is thus to simulate artificial sequences to generate benchmarks. However, these artificial sequences are usually generated with the same general model of evolution as the one used by the phylogenetic method being tested, with only minor differences (see for example Rasmussen and Kellis (2012); Boussau et al. (2013)). There is thus a form of circularity in the overall process, which could leave some important as- pects of evolutionary dynamics in the dark. To provide better benchmarks for phylogenetic inference, some simulators like EvolSimulator (Beiko and Charlebois, ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems

Upload: vancong

Post on 02-Jan-2017

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What happened to my genes? Insights on gene family dynamics

What happened to my genes?Insights on gene family dynamics from digital genetics experiments

C. Knibbe1,2 and D. P. Parsons11INRIA Rhone-Alpes, Montbonnot F-38322, France

2Universite de Lyon, Universite Lyon 1, CNRS, UMR5205, LIRIS, Villeurbanne, F-69622, [email protected]

Abstract

Gene families are sets of homologous genes formed by du-plications of a single original gene. Inferring their history interms of gene duplications, gene losses and gene mutationsyields fundamental insights into the molecular basis of evo-lution. However, phylogenetic inference of gene family evo-lution faces two difficulties: (i) the delimitation of gene fam-ilies based on sequence similarity, and (ii) the fact that themodels of evolution used for reconstruction are tested againstsimulated data that are produced by the model itself. Here,we show that digital genetics, or in silico experimental evo-lution, can provide thought-provoking synthetic gene familydata, robust to rearrangements in gene sequences and, mostimportantly, not biased by where and how we think naturalselection should act. Using aevol, a digital genetics modelwith an abstract phenotype but a realistic genome structure,we analyzed the evolution of 3,512 synthetic gene familiesunder directional selection. The turnover of gene familiesin evolutionary runs was such that only 21% of those fami-lies would be accessible for classical phylogenetic inference.Extinct families showed patterns different from the final, ob-servable ones, both in terms of dynamics of gene gains andlosses and in terms of gene sequence evolution. This studyalso reveals that gene sequence evolution, and thus evolution-ary innovation, occurred not only through local mutations, butalso through chromosomal rearrangements that re-assembledparts of existing genes.

IntroductionHow do new genes arise? Do they evolve mainly by localmutations or domain shuffling? Which events drive them toextinction? These questions are fundamental to understandthe evolutionary dynamics of living systems. Because thepreservation of soft tissues is rare in fossil records, paleon-tology provides precious but limited knowledge of the pastof the living world. The study of molecular evolution thuslargely relies on the analysis of extant genes, which mayor may not be a representative sample of genetic diversitythroughout the evolution of life. Central in this analysis isthe notion of gene family, defined as a set of homologousgenes formed by duplications of a single original gene. In-sights into the evolutionary dynamics at the molecular levelare obtained by inferring the evolutionary history of genegains and losses and of gene mutations in a gene family.

The usual strategy to identify gene families consists indetecting significant sequence similarities in gene or pro-tein sequences. This method is inherently biased towardsthe detection of families that evolve mainly through localmutations rather than domain shuffling. As Song et al.(2008) make it clear, “multidomain sequences, especiallythose with promiscuous domains that occur in many con-texts, are frequently excluded from genomic analyses due tothe lack of a theoretical framework and practical methods fordetecting multidomain homologs”. Efforts are thus ongo-ing to develop multidomain homology identification meth-ods (Geer et al., 2002; Enright et al., 2002; Lin et al., 2006;Song et al., 2008; Jachiet et al., 2013).

Once gene families have been identified, their evolution-ary histories are inferred, using implicit or explicit modelsof evolution to describe the patterns of DNA base substitu-tion and amino acid replacement (Lio and Goldman, 1998)and the patterns of gene gains and losses (Arvestad et al.,2004; Vilella et al., 2008; Akerborg et al., 2009; Rasmussenand Kellis, 2012; Boussau et al., 2013). These models ofevolution make assumptions – for example, the model usedby Vilella et al. (2008) assumes that gene duplications anddeletions are rare events, and that duplication followed bycomplementary gene losses on the left and right branchesof a duplication node is an unlikely scenario. Most mod-els also assume that different gene families evolve indepen-dently, while a single duplication or deletion can actuallyspan several genes. The scarcity of well-preserved ancientDNA samples makes it difficult to really test these hypothe-ses. The common practice to test a phylogenetic method isthus to simulate artificial sequences to generate benchmarks.However, these artificial sequences are usually generatedwith the same general model of evolution as the one usedby the phylogenetic method being tested, with only minordifferences (see for example Rasmussen and Kellis (2012);Boussau et al. (2013)). There is thus a form of circularityin the overall process, which could leave some important as-pects of evolutionary dynamics in the dark.

To provide better benchmarks for phylogenetic inference,some simulators like EvolSimulator (Beiko and Charlebois,

ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems

blriley
Typewritten Text
DOI: http://dx.doi.org/10.7551/978-0-262-32621-6-ch006
Page 2: What happened to my genes? Insights on gene family dynamics

2007) and ALF (Dalquen et al., 2012) have been devel-oped independently of a particular phylogenetic inferencemethod. For example, ALF uses classical models of evo-lution at the gene sequence level, but allows for the dupli-cation or loss of several consecutive genes at once. How-ever, both ALF and EvolSimulator simulate only one se-quence per species. The action of natural selection is in-corporated in the mutational process, in the sense that onlymutations assumed to be neutral or beneficial are simulated.For example, mutations that would lead to the formation ofa stop codon (nonsense mutations) are not allowed in ALF(Dalquen et al., 2012). In EvolSimulator, specific probabil-ities of duplication and loss are pre-assigned to each gene(Beiko and Charlebois, 2007). Simulators of sequence evo-lution are also being developed in population genetics (re-viewed by Hoban et al. (2012)), to predict the molecularpolymorphism expected under various demographic scenar-ios. All individuals of the population are simulated but thegenomic architecture is usually fixed, implying that genegains or losses are not allowed. Deleterious events are al-lowed but the distribution of fitness effects is predefined ateach locus.

Experimental evolution of microbes is a much more di-rect way to study evolution. Although the time scale ofthese laboratory experiments is short compared to those atstakes in phylogenetic reconstruction, gene gains and lossesdo occur (Nilsson et al., 2005; Blount et al., 2012; Tenaillonet al., 2012; Maharjan et al., 2013; Payen et al., 2014) andfrozen samples allow for a partial fossil record. In silico ex-perimental evolution, or digital genetics (Adami, 2006), canbring a complementary perspective by providing fast syn-thetic genomic data, in which – in contrast to other types ofsimulators – the action of natural selection on genomic se-quences is not predetermined by the user. In digital genetics,an abstract artificial chemistry is used to compute a pheno-type from a genotype, and selection is based on the pheno-type, not on the genotype. The Avida platform has alreadybeen used to test the effect of selection on phylogenetic re-construction methods (Hagstrom et al., 2004; Hang et al.,2007). Here, we show how a digital genetics platform with arealistic genome structure can be used to directly study genefamily evolution, with both local mutations and rearrange-ments. The exhaustive knowledge of all evolutionary eventsallows for the identification of gene families even when se-quence similarity could be impaired by rearrangements thatshuffled parts of coding sequences.

After a presentation of the model, called aevol (http://www.aevol.fr), we study the evolution of ten inde-pendent populations under directional selection, yielding3,512 synthetic gene families. Using a new postprocessingtool designed for this purpose, we analyze the dynamics ofgenes within the context of those families – how and how of-ten new genes are created, how and how often they are lost,how many and what types of mutational events occur in their

sequences. By going beyond the simple time series of genenumber, we show that our usual interpretation of gene num-ber evolution in aevol was partly wrong. Not all new genesarise by duplication-divergence. Many are also created frompreviously non-coding sequences, after either a local muta-tion or a rearrangement. We also show that there is a highturnover of gene families, many of them lasting only a fewhundreds or thousands of generations. This implies that finalextant genes give only a partial insight into the dynamics ofgene family evolution. Moreover, our analysis reveals thatrearrangements do not restrict themselves to changing genenumber and gene order. They also play a significant role ingene sequence evolution, and thus in evolutionary innova-tion, by rearranging parts of existing genes.

Aevol: A digital genetics modelAevol is a digital genetics model that simulates the evolutionof a population ofN haploid organisms through a process ofvariation and selection. It was designed to study the evolu-tion of genome structure (Knibbe et al., 2007; Beslon et al.,2010; Parsons et al., 2010; Frenoy et al., 2013; Batut et al.,2013). Thus, the design of the model focuses on the realismof the genome level and of the mutational process, while theselection process simply relies on a one-dimensional curve-fitting task.

Genome representationEach artificial organism owns a chromosome whose struc-ture is inspired by prokaryotic genomes. It is organized asa circular double-strand binary string containing a variablenumber of genes separated by non-coding sequences (figure1). Genes are delimited by predefined signaling sequencesindicating transcription and translation start and stop. Tran-scription initiates at promoters, defined in the model as se-quences that differ from an (arbitrarily chosen) 22-bp con-sensus sequence by d ≤ 4 mismatches. When a promoteris found, the transcription proceeds until a terminator isreached. Terminators are defined as sequences that wouldbe able to form a stem-loop structure, as the ρ-independentbacterial terminators do. In the following experiments, ter-minators had the structure abcd ∗ ∗ ∗ dcba, where a = 0 ifa = 1, and conversely. The expression level e of an mRNAis determined according to the similarity of its promoter tothe consensus: e = 1− d

5 .Transcribed sequences (mRNAs) do not necessarily con-

tain coding sequences. The translation initiation signal isthe motif 011011 ∗ ∗ ∗ ∗000 (Shine-Dalgarno-like sequencefollowed, a few base-pairs away, by a START codon). Whenthis signal is found on a mRNA, the downstream sequence isread three bases (one codon) at a time until the terminationsignal, the STOP codon 001, is found on the same readingframe. Each codon lying between the initiation and termina-tion signals is translated into an abstract “amino-acid” usingan artificial genetic code (Figure 1).

ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems

Page 3: What happened to my genes? Insights on gene family dynamics

Phenotypictrait

Contributionlevel

Phenotype

Promoter

Shine-Dalgarno

START

Codingsequence

STOP

Terminator

ChromosomeTranscription

Expression level e of the mRNA depends on the promoter sequence

Translation

m = 11Gray

w = 01Gray

h = 10Gray

m = 10binary

w = 01binary

h = 11binary

m = 0.667

w = 0.333 w

h = 1.0max

W0 - M1 - H1 - W1 - M1 - H0 «Aminoacid» sequence of the protein:

0110 11

11001

10101001100011011010010001001000000111101010000100010001110111 000

01010110011100100101101110110111111000010101111011101110001000 111

1001 000

0110

Genetic code

000 START001 STOP100 M0101 M1010 W0011 W1110 H0111 H1

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Figure 1: In the model, each organism owns a circular double-strand binary chromosome (a) along which genes are delimited bypredefined signal sequences (b). Promoters and terminators mark the boundaries of RNAs (c) within which coding sequencesare in turn identified between a Shine-Dalgarno-START signal and an in-frame STOP codon. Each coding sequence is thentranslated into a protein sequence using a predefined genetic code (d). This protein sequence is decoded as three real parameterscalled m, w and h (e). Proteins, phenotypes and environments are represented similarly through mathematical functions thatassociate a level to each abstract phenotypic trait in [0, 1]. The contribution of a protein is a piecewise-linear function with atriangular shape, with position m, half-width w and height h (f). All proteins encoded in the chromosome are then combinedto compute the phenotype (g), which is compared to the environmental target to compute the fitness of the individual.

Protein function and phenotype computationIn the model, we assume that there is an abstract, continu-ous one-dimensional space Ω = [0, 1] of phenotypic traits.Each protein contributes positively or negatively to a subsetof phenotypic traits, and is modeled as a mathematical func-tion that associates a contribution level between -1.0 and 1.0to each phenotypic trait. For simplicity, we use piecewise-linear functions with a symmetric, triangular shape (figure1). In this way, only three numbers are needed to character-ize the contribution of a protein: The position m (m ∈ Ω)of the triangle on the axis, its half-width w and its heighth (positive or negative). The protein thus contributes to thephenotypic traits in [m − w,m + w], with a maximal con-tribution for the traits closest to m. Thus, various types ofproteins can co-exist, from highly efficient and highly spe-cialized ones (low w, high h) to polyvalent but poorly effi-cient ones (high w, low h).

In this framework, the sequence of each protein is decom-posed into three interlaced binary subsequences that will inturn be decoded as the values for the m, w and h parame-ters. For instance, the codon 010 (resp. 011) is translatedinto the single amino acid W0 (resp. W1), which meansthat it adds a bit 0 (resp. 1) to the Gray code of w. (TheGray code is a variant of the traditional binary code. It iswidely used in evolutionary computation because it avoidsthe so-called Hamming cliffs: in the Gray code representa-tion, consecutive integers are assigned bit strings that differby only one bit.). Small mutations in the coding sequence

(point mutations, indels, possibly causing frame shifts) canchange these parameters and hence change the contributionof the protein to the phenotypic traits.

Once all the proteins encoded on the genotype of the or-ganism have been identified, their contributions are com-bined to get the final level for each phenotypic trait. This isdone by summing the mathematical functions of all proteinsand keeping the result bounded between 0 and 1.0. The re-sulting piecewise-linear function fP : Ω→ [0, 1.0] is calledthe phenotype of the organism. It indicates the level of eachphenotypic trait in Ω.

Environment, adaptation and selectionIn the model, fitness depends on the difference between thelevels of the phenotypic traits, and target levels defined by amathematical function fT : Ω → [0, 1.0]. This target func-tion indicates the optimal level of each phenotypic trait inΩ and is called the environmental target, or target for short.Here, fT was made up of three gaussian lobes with standarddeviation 0.05 and maximal height 0.5, centered on x = 0.2,0.6 and 0.8 respectively. It was kept constant over evolution-ary time. Adaptation was specifically measured by the gapg =

∫Ω|fT (x) − fP (x)|dx between fP and fT . The lower

the gap, the fitter the individual. This measure penalizesboth the under-realization and the over-realization of eachphenotypic trait.

In the current version of Aevol, the population size is con-stant (here N = 1, 000 individuals) and the population is

ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems

Page 4: What happened to my genes? Insights on gene family dynamics

entirely renewed at each generation. A probability of repro-duction is assigned to each individual according to its gapand a multinomial drawing determines the actual number ofoffsprings each individual will have. Here, we used the so-called “fitness-proportionate” selection scheme, where theprobability of reproduction of an individual with gap g was

e−kg∑N

i=1e−kgi

. Here the environment was considered perfectly

mixed, but a spatial grid structure where an individual com-petes only with its neighbors can also be used.

Mutations and rearrangementsDuring their replication, genomes can undergo local muta-tions (point mutations and small insertions or deletions of1 to 6 bp) and chromosomal rearrangements (duplications,deletions, translocations and inversions). The breakpointsfor these rearrangements are randomly chosen on the chro-mosome. A translocation is here defined as moving a seg-ment to another position on the chromosome. Other ver-sions of the platform exist that allow for plasmids, in whichcase translocations can move subsequences from the chro-mosome to the plasmids and conversely. This feature wasnot used here. Similarly, although lateral transfer is possiblein aevol, we did not use it in these experiments, to keep thesetup simple in this first study of gene family evolution. Therates of the different types of genetic modification occurs aredefined per-base, per-replication.

Workflow with the software suiteThe typical workflow with the aevol software suite startswith the preparation of the initial population following theinitialization method chosen by the user (with a randomgenome or with a mix of already evolved ones, for a com-petition assay for example). The second step is the evolu-tionary run itself. Depending on the parameters, a run of100, 000 generations may take from several hours up to sev-eral days. The population size is important, but the sponta-neous rates of rearrangements as well, since they influencethe evolved genome size (Knibbe et al., 2007). If ancestryrelationships and mutations have been recorded during therun, and if individuals were asexual, it is possible, as a thirdstep, to extract the line of descent of the best final individualfrom the recorded data and to replay the evolutionary eventsthat occurred along this successful lineage. Except for thevery last mutations that were possibly still segregating in thepopulation, the replayed mutations are those that were fixed.

ResultsWe let 10 populations of 1, 000 haploid asexual individu-als evolve independently under directional selection during100, 000 generations. The spontaneous rate of each type ofmutational event was set to 10−5 per bp. At the beginning ofa run, all organisms of the population were initialized with asame random sequence of 5, 000 bp containing at least one

gene. This initial sequence was different for each popula-tion. In practice, populations started with either one or twogenes. With this setup, we do not aim at mimicking the ori-gin of life but rather the adaptation to a novel niche. Indeed,in the model, an individual without any gene on its chromo-some can still replicate itself and express its genes: “Coregenes” for replication, transcription, translation are assumedto be implicitly present in each individual and their evolutionis not modeled. What is actually simulated is the evolutionof the non-essential subset of the genome, when the popula-tion faces a new environment.

As shown by Figure 2, genome evolution on the success-ful lineages starts with a phase of expansion, where newgenes are massively acquired, along with much non codingDNA. This excess DNA is then progressively removed fromthe genome, while gene acquisition slows down. This pat-tern was already observed in (Knibbe et al., 2007), in a sim-ilar setup. Here, we went deeper into the analysis of generepertoire dynamics by tracking the fate of each gene, aswell as the paralogy relationships between genes.

Num

ber o

f cod

ing

sequ

ence

s

0 50000 100000

0

50

100

150

Generations

Non

ess

entia

l DN

A (b

p, lo

g. s

cale

)

0 50000 100000

10^3

10^4

10^5

10^6

Figure 2: Evolution of genome size on the line of descentof the final best individuals. The shaded area indicates thestandard deviation across repetitions. Non essential DNA isdefined as DNA that can be removed without changing thephenotype. It includes intergenic DNA, but also the tran-scribed but untranslated regions (UTRs).

For each repetition, evolutionary events on the line of de-scent of the best final individual were replayed. Each genein the initial genome was tagged and considered the root of agene family, which was stored as a binary tree. When replay-

ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems

Page 5: What happened to my genes? Insights on gene family dynamics

ing the mutational events, the fate of each gene was followedand recorded mutation after mutation. We considered in thisanalysis that a gene was composed of its coding sequenceand of its “upstream region”, defined as the sequence locatedbetween the first base pair of the first promoter of the codingsequence and the first base pair of the Shine-Dalgarno-startsignal for translation initiation.

Figure 3 shows an example of a small gene family. Thetopology of the tree indicates the dynamics of gene dupli-cations and losses. Because the exact timing of events isknown, the branch lengths represent the real time elapsed,in number of generations. The branches are annotated withthe mutational events that affected either the coding se-quences or their upstream regions. These events can eitherbe local mutations or chromosomal rearrangements. Indeed,aevol, along with the “ARN” model (Banzhaf, 2003) andpotentially the model of early metabolism by Ullrich et al.(2011), is one of the few digital genetics models in whichthe breakpoints of chromosomal rearrangements are not con-strained to intergenic regions. They operate at the sequencelevel rather than at the gene level and are thus blind to thegenic/intergenic status of the sequences they disrupt. As aconsequence, they can modify gene content and gene order,but also generate variability in gene sequences.

Number of gene familiesIn previous studies with aevol, which mostly relied on thetime series of gene number like on Figure 2, we thought thatmost evolved genes ultimately descended from the initialone(s). Indeed, because simulations started with only oneor two genes and because several initiation signals are neces-sary for a sequence to be coding, one would expect that mostgenes would be created by a duplication-divergence processand that each run would thus contain one or two gene fami-lies only. There were actually on average 351.2±158.3 genefamilies per evolutionary run, contrary to what we expected.This is not an artifact due to a saturation of the phylogeneticsignal, because gene family identification is here based onthe exact knowledge of all events and not on sequence sim-ilarity, and is thus insensitive to such a saturation. Thus, denovo gene creation was not rare. On average, in a run, thegene families of the initialization represented only 0.4% ofall families and 10.8% of the final genes. This fits with arecent analysis of proto-genes candidates in the genome ofthe yeast S. cerevisiae, which suggested that “de novo genebirth may be more prevalent than sporadic gene duplication”(Carvunis et al., 2012). In our synthetic dataset, 55% of denovo gene creations were due to a local mutation and around44% were due to a chromosomal rearrangement.

The large variation across runs in the number of familiesstems from a bimodal distribution, with seven runs centeredaround 281 families (hereafter called group A) while threeother runs (called group B) are centered around 632 fam-ilies. This variation is due to the initial phase of genome

expansion, since on average 61% of the gene families inthese three runs were extinct before generation 10, 000. Atthe end of the runs (t = 100, 000 generations), on average73.4± 6.2 gene families were still active in each run, mean-ing that they had at least one remaining gene. Thus, thefamilies that would be accessible for classical phylogeneticinference would represent only 21% of the gene families thatplayed a role in the evolutionary history of the evolved pop-ulations.

Size of gene familiesWhat is usually called the family size is the number of non-extinct leaves at the time of observation. Here, the meanfamily size at t = 100, 000 was 1.36 ± 0.1 non-extinctleaves, meaning that each family had on average 1.36 siblinggenes (paralogs) in the evolved genome. In real genomes,gene family size is known to follow a power-law distribution(Huynen and van Nimwegen, 1998), with a vast majority ofvery small gene families and a few very large families. Asshown by Figure 4, the family sizes obtained here do notspan enough orders of magnitude to conclude to a power-law distribution. However, in all runs, the vast majority offamilies had size 1, while only one or two families had a sizelarger than 4.

Family size at t=100,000

Num

ber o

f fam

ilies

1 2..3 4..7 8..15 16..31

1

10

100

Figure 4: Distribution of gene family size at t = 100, 000.The different symbols correspond to the different repetitionsand the black curve is the mean frequency over the ten repe-titions. Following Huynen and van Nimwegen (1998), fam-ily sizes were binned exponentially and both axes are loga-rithmic. Missing symbols correspond to a frequency of 0.

Rates of gene duplication and lossWhen taking all families of a run into account, a gene gainby duplication occurred on average every 212 generationsin group A, and every 5.5 generations in group B. A geneloss occurred every 158 generations on average in group A,and every 5.4 generations in group B (63% of gene losseshappened by the complete deletion of the gene, 16% weredue to another chromosomal rearrangement and 21% weredue to a local mutation). However, these rates of gene

ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems

Page 6: What happened to my genes? Insights on gene family dynamics

200 generationsGene #164

Gene #163

Gene #782

Gene #781

Gene #340

Gene #433

Gene #432

De novo gene creationby a rearrangement

at t=214

Gene #130 Duplication at t=240

Duplication at t=247

Lost at t=273 after a point mutationin the coding sequence

Lost at t=255 after a translocationaffecting both the coding sequenceand its upstream region

Lost at t=1,449 after an inversionaffecting the coding sequence

Small deletion in the upstream region

Duplication at t=1,425

Lost at t=1,904 after a translocation affectingthe coding sequence

Inversion and translocation in the upstream region

Small deletion in the upstream region

Lost at t=1,145 after an inversion affectingthe coding sequence

Small insertion in the upstream region

Transloc.in the codingsequence

Small deletion in the coding sequence

Transloc.in the upstreamregion

Duplication at t=297

Duplication at t=407

Deleted at t=407

Duplication at t=379

Deleted at t=379

Figure 3: Example of a small gene family. This is the fourth gene family of the third run. It was born at t = 214 and went toextinction at t = 1, 904. Squares indicate duplications, crosses indicate deletions, stars indicate inversions or translocations, andcircles indicate point mutations, small insertion or small deletions. Gray events were neutral, while red events were deleteriousand green events were beneficial. The topology of the tree was drawn with NJPlot (Perriere and Gouy, 1996).

gains and losses are somehow misleading because 90% ofthe gene duplications and gene losses occurred before gen-eration 5, 500. Hence, during the initial phase of genomeexpansion, the gene content is extremely dynamic. Be-sides, duplications and deletions generally encompass sev-eral neighboring genes, thereby creating strong correlationsacross families and inside different subtrees of a family. Forexample, in the first run, there were 604 gene duplicationsbut they were concentrated on 85 different generations only.

Evolutionary rates of gene sequencesOn each branch of each gene family, we counted and clas-sified the events that modified the gene sequence withoutkilling it, and divided these counts by the branch length innumber of generations. Those rates per gene per genera-tion were averaged over all branches of all gene families ofall runs to produce Figure 5A. It shows that the coding se-quences underwent more changes than the upstream region.This an expected result given that (in the model) changesin the coding sequences can change the phenotypic traits towhich the gene contributes, whereas changes in the promotercan just modulate the level of a gene contribution. Both localmutations and rearrangements modified gene sequences, butrearrangements were more numerous than local mutations inbranches shorter than 290 generations, which represent 90%of the dataset. Thus, the mean rate of rearrangements overall branches turns up to be higher than the mean rate of lo-

cal mutations. Beneficial mutations were also more frequentthan neutral events, which is expected under a directional se-lection setting, but also depends on the fact that the artificialgenetic code is not redundant. Neutral mutations can happenbetween the promoter and the start signal, but it is a rathersmall mutational target. Neutral mutations can also happenin intergenic regions, but those were not monitored here.

Figure 5B shows the overall normalized variation of eachrate across all branches of all gene families of all runs. Theindicator with the lowest normalized variation is the rateof all events that affect the gene sequence, regardless orwhether they are neutral or beneficial, local mutation or re-arrangement. It cannot, however, be chosen as a molecularclock, because it includes non-neutral events, whose countwould be affected by the strength and type of selection. Agood candidate for a molecular clock should thus both min-imize its variation across branches and trees, and count neu-tral events only, in order to be robust to the selection regime.According to this criterion, the rate of all neutral events af-fecting the gene sequence, including rearrangements, wouldmake a slightly better molecular clock than the rate of neu-tral local mutations only (Figure 5B, orange bars).

Analyses of gene families in real genomes of fungi, in-sects, and mammals have revealed a negative correlation be-tween the age of the family and the evolutionary rate of itsmembers (Capra et al., 2013). As shown by Figure 5C, sucha negative correlation is clear in our synthetic data if all gene

ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems

Page 7: What happened to my genes? Insights on gene family dynamics

Beneficial rearrangements in upstream regionBeneficial local mutations in upstream region

Neutral rearrangements in upstream regionNeutral local mutations in upstream region

Beneficial rearrangements in coding sequencesBeneficial local mutations in coding sequences

Neutral rearrangements in coding sequencesNeutral local mutations in coding sequences

Rearrangements in coding sequencesLocal mutations in coding sequenceRearrangements in upstream regionLocal mutations in upstream region

Beneficial rearrangementsBeneficial local mutations

Neutral rearrangementsNeutral local mutations

RearrangementsLocal mutations

Events in upstream regionEvents in coding sequence

Beneficial eventsNeutral events

All events

Mean rate per generation

Beneficial rearrangements in upstream regionBeneficial local mutations in upstream region

Neutral rearrangements in upstream regionNeutral local mutations in upstream region

Beneficial rearrangements in coding sequencesBeneficial local mutations in coding sequences

Neutral rearrangements in coding sequencesNeutral local mutations in coding sequences

Rearrangements in coding sequencesLocal mutations in coding sequenceRearrangements in upstream regionLocal mutations in upstream region

Beneficial rearrangementsBeneficial local mutations

Neutral rearrangementsNeutral local mutations

RearrangementsLocal mutations

Events in upstream regionEvents in coding sequence

Beneficial eventsNeutral events

All events

Normalized dispersion

A

B

Time span of the gene family

Rat

e of

seq

uenc

e ev

olut

ion,

incl

. rea

rr.

(mea

n on

all

bran

ches

of t

he fa

mily

)

0.0001

0.001

0.01

0.1

1

1 10 100 1,000 10,000 100,000

C

r = -0.78, pvalue < 2.10^-16r = +0.23, pvalue = 3.10^-10

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

0 1000 2000 3000 4000 5000 6000

Figure 5: A. Mean evolutionary rate, per gene per gener-ation, for each type of event (average over all branches ofall gene families of all runs). B. Relative standard variation(100 × standard deviation / mean) of each indicator acrossall branches of all gene families of all runs. Orange barscorrespond to neutral indicators, candidates for a molecu-lar clock. C. Correlation between the logarithm of familytime span and the logarithm of the mean evolutionary ratein the family (black = all gene families, green = active fam-ilies at t=100, 000). The mean evolutionary rate of a familywas computed as the average, over all its branches, of theper-generation rate of events that modify the gene withoutkilling it. This is the indicator called “All events” in panelsA and B. It includes both local mutations and chromosomalrearrangements.

families are considered (r = −0.78, p-value < 2× 10−16).However, if one restricts the analysis to the observable fam-ilies in the final evolved genomes, the correlation becomespositive but weaker (r = +0.23, p-value ∼ 3 × 10−10).In the final genomes accessible to phylogenetic analyses, notrace remains from the fast evolving gene families that sup-ported the initial (yet important) steps of the adaptation tothe novel environment. Note, however, that even these finalobservable families evolve relatively fast and that we are ac-tually simulating only the subset of non essential genes thatconfer a selective advantage in the novel environment. Incontrast, in real datasets reviewed in (Capra et al., 2013), thecore genes for e.g. replication and gene expression would beincluded. This could explain the differences in the correla-tion patterns between the synthetic and the real data.

ConclusionThis study of synthetic gene families revealed that, uponadaptation to a new environment, (i) there was a highturnover of gene families and extinct families showed pat-terns different from the final, observable ones, both in termsof dynamics of gene gains and losses and in terms of genesequence evolution, (ii) gene sequence evolution occurredthrough both local mutations and chromosomal rearrange-ments, and (iii) incorporating chromosomal rearrangementsin the evolutionary rate of gene sequences would slightly im-prove the accuracy of the molecular clock. Although someof the results can depend on the simplifications of the model– like the absence of redundancy of the artificial genetic code–, the study is a demonstration of how digital genetics canexplore data inaccessible to classical phylogenetic methods.With refined models developed in close collaboration withphylogeneticists, digital genetics could prompt a reassess-ment of the biases and limitations in the studies of evolu-tionary dynamics of genes.

AcknowledgementsThis research program was supported by the EvoEvo FP7European project, by the PEPII program of the FrenchCNRS and by the Rhone-Alpes Institute for Complex Sys-tems (IXXI). We thank Eric Tannier and Guillaume Beslonfor the inspiring discussions and comments.

ReferencesAdami, C. (2006). Digital genetics: unravelling the genetic basis

of evolution. Nat. Rev. Genet., 7:109–118.

Akerborg, O., Sennblad, B., Arvestad, L., and Lagergren, J. (2009).Simultaneous Bayesian gene tree reconstruction and reconcil-iation analysis. Proc Natl Acad Sci USA, 106(14):5714–5719.

Arvestad, L., Berglund, A.-C., Lagergren, J., and Sennblad, B.(2004). Gene tree reconstruction and orthology analysisbased on an integrated model for duplications and sequenceevolution. In Proc. RECOMB 2004, pages 326–335. ACMPress, New York.

ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems

Page 8: What happened to my genes? Insights on gene family dynamics

Banzhaf, W. (2003). On the dynamics of an artificial regulatorynetwork. In Banzhaf, W., Ziegler, J., Christaller, T., Dittrich,P., and Kim, J., editors, Advances in Artificial Life, volume2801 of Lecture Notes in Computer Science, pages 217–227.Springer Berlin Heidelberg.

Batut, B., Parsons, D., Fischer, S., Beslon, G., and Knibbe, C.(2013). In silico experimental evolution: a tool to test evo-lutionary scenarios. BMC Bioinformatics, 14(Suppl 15):S11.

Beiko, R. G. and Charlebois, R. L. (2007). A simulation test bed forhypotheses of genome evolution. Bioinformatics, 23(7):825–831.

Beslon, G., Parsons, D. P., Sanchez-Dehesa, Y., Pena, J. M., andKnibbe, C. (2010). Scaling laws in bacterial genomes: Aside-effect of selection of mutational robustness. BioSystems,102(1):32–40.

Blount, Z. D., Barrick, J. E., Davidson, C. J., and Lenski, R. E.(2012). Genomic analysis of a key innovation in an experi-mental Escherichia coli population. Nature, 489(7417):513–518.

Boussau, B., Szollosi, G. J., Duret, L., Gouy, M., Tannier, E., andDaubin, V. (2013). Genome-scale coestimation of species andgene trees. Genome Research, 23(2):323–330.

Capra, J. A., Stolzer, M., Durand, D., and Pollard, K. S. (2013).How old is my gene? Trends in Genetics, 29(11):659–668.

Carvunis, A.-R., Rolland, T., Wapinski, I., Calderwood, M. A.,Yildirim, M. A., Simonis, N., Charloteaux, B., Hidalgo,C. A., Barbette, J., Santhanam, B., Brar, G. A., Weissman,J. S., Regev, A., Thierry-Mieg, N., Cusick, M. E., and Vi-dal, M. (2012). Proto-genes and de novo gene birth. Nature,487(7407):370–374.

Dalquen, D. A., Anisimova, M., Gonnet, G. H., and Dessimoz, C.(2012). ALF–A Simulation Framework for Genome Evolu-tion. Molecular Biology and Evolution, 29(4):1115–1123.

Enright, A. J., Van Dongen, S., and Ouzounis, C. A. (2002). An ef-ficient algorithm for large-scale detection of protein families.Nucleic Acids Research, 30(7):1575–1584.

Frenoy, A., Taddei, F., and Misevic, D. (2013). Genetic architec-ture promotes the evolution and maintenance of cooperation.PLoS Computational Biology, 9(11):e1003339.

Geer, L. Y., Domrachev, M., Lipman, D. J., and Bryant, S. H.(2002). CDART: protein homology by domain architecture.Genome Research, 12(10):1619–1623.

Hagstrom, G. I., Hang, D. H., Ofria, C., and Torng, E. (2004). Us-ing Avida to test the effects of natural selection on phyloge-netic reconstruction methods. Artificial life, 10(2):157–166.

Hang, D., Torng, E., Ofria, C., and Schmidt, T. M. (2007). Theeffect of natural selection on the performance of maximumparsimony. BMC Evolutionary Biology, 7(1):94.

Hoban, S., Bertorelle, G., and Gaggiotti, O. E. (2012). Computersimulations: tools for population and evolutionary genetics.Nature Reviews Genetics, 13(2):110–122.

Huynen, M. A. and van Nimwegen, E. (1998). The frequency dis-tribution of gene family sizes in complete genomes. Molecu-lar Biology and Evolution, 15(5):583–589.

Jachiet, P. A., Pogorelcnik, R., Berry, A., Lopez, P., and Bapteste,E. (2013). MosaicFinder: identification of fused genefamilies in sequence similarity networks. Bioinformatics,29(7):837–844.

Knibbe, C., Mazet, O., Chaudier, F., Fayard, J.-M., and Beslon, G.(2007). Evolutionary coupling between the deleteriousnessof gene mutations and the amount of non-coding sequences.J. Theor. Biol., 244(4):621–630.

Lin, K., Zhu, L., and Zhang, D. Y. (2006). An initial strategy forcomparing proteins at the domain architecture level. Bioin-formatics, 22(17):2081–2086.

Lio, P. and Goldman, N. (1998). Models of molecular evolutionand phylogeny. Genome Research, 8(12):1233–1244.

Maharjan, R. P., Gaff, J. l., Plucain, J., Schliep, M., Wang, L., Feng,L., Tenaillon, O., Ferenci, T., and Schneider, D. (2013). Acase of adaptation through a mutation in a tandem duplica-tion during experimental evolution in Escherichia coli. BMCGenomics, 14(1):1–1.

Nilsson, A. I., Koskiniemi, S., Eriksson, S., Kugelberg, E., Hin-ton, J. C. D., and Andersson, D. I. (2005). Bacterial genomesize reduction by experimental evolution. Proc Natl Acad SciUSA, 102(34):12112–12116.

Parsons, D. P., Knibbe, C., and Beslon, G. (2010). Importance ofthe rearrangement rates on the organization of transcription.In Proceedings of Artificial Life XII, pages 479–486.

Payen, C., Di Rienzi, S. C., Ong, G. T., Pogachar, J. L.,Sanchez, J. C., Sunshine, A. B., Raghuraman, M. K.,Brewer, B. J., and Dunham, M. J. (2014). The dynamicsof diverse segmental amplifications in populations of sac-charomyces cerevisiae adapting to strong selection. G3:Genes—Genomes—Genetics, 4(3):399–409.

Perriere, G. and Gouy, M. (1996). Www-query: An on-line re-trieval system for biological sequence banks. Biochimie,78(5):364 – 369.

Rasmussen, M. D. and Kellis, M. (2012). Unified modeling of geneduplication, loss, and coalescence using a locus tree. GenomeResearch, 22(4):755–765.

Song, N., Joseph, J. M., Davis, G. B., and Durand, D. (2008).Sequence Similarity Network Reveals Common Ancestryof Multidomain Proteins. PLoS Computational Biology,4(5):e1000063.

Tenaillon, O., Rodriguez-Verdugo, A., Gaut, R. L., McDonald,P., Bennett, A. F., Long, A. D., and Gaut, B. S. (2012).The Molecular Diversity of Adaptive Convergence. Science,335(6067):457–461.

Ullrich, A., Rohrschneider, M., Scheuermann, G., Stadler, P. F.,and Flamm, C. (2011). In silico evolution of earlymetabolism. Artificial Life, 17(2):87–108.

Vilella, A. J., Severin, J., Ureta-Vidal, A., Heng, L., Durbin, R., andBirney, E. (2008). EnsemblCompara GeneTrees: Complete,duplication-aware phylogenetic trees in vertebrates. GenomeResearch, 19(2):327–335.

ALIFE 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems