orthologs, paralogs, and evolutionary...

32
Orthologs, Paralogs, and Evolutionary Genomics 1 Eugene V. Koonin National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894; email: [email protected] Annu. Rev. Genet. 2005. 39:309–38 First published online as a Review in Advance on August 30, 2005 The Annual Review of Genetics is online at http://genet.annualreviews.org doi: 10.1146/ annurev.genet.39.073003.114725 Copyright c 2005 by Annual Reviews. All rights reserved 1 The U.S. Government has the right to retain a nonexclusive, royalty-free license in and to any copyright covering this paper. 0066-4197/05/1215- 0309$20.00 Key Words homolog, ortholog, paralog, pseudoortholog, pseudoparalog, xenolog Abstract Orthologs and paralogs are two fundamentally different types of ho- mologous genes that evolved, respectively, by vertical descent from a single ancestral gene and by duplication. Orthology and paral- ogy are key concepts of evolutionary genomics. A clear distinction between orthologs and paralogs is critical for the construction of a robust evolutionary classification of genes and reliable functional an- notation of newly sequenced genomes. Genome comparisons show that orthologous relationships with genes from taxonomically dis- tant species can be established for the majority of the genes from each sequenced genome. This review examines in depth the defini- tions and subtypes of orthologs and paralogs, outlines the principal methodological approaches employed for identification of orthology and paralogy, and considers evolutionary and functional implications of these concepts. 309 Annu. Rev. Genet. 2005.39:309-338. Downloaded from arjournals.annualreviews.org by Stanford University Robert Crown Law Lib. on 05/01/07. For personal use only.

Upload: others

Post on 29-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Orthologs, Paralogs, andEvolutionary Genomics1

    Eugene V. KooninNational Center for Biotechnology Information, National Library of Medicine,National Institutes of Health, Bethesda, Maryland 20894;email: [email protected]

    Annu. Rev. Genet.2005. 39:309–38

    First published online as aReview in Advance onAugust 30, 2005

    The Annual Review ofGenetics is online athttp://genet.annualreviews.org

    doi: 10.1146/annurev.genet.39.073003.114725

    Copyright c© 2005 byAnnual Reviews. All rightsreserved

    1The U.S. Governmenthas the right to retain anonexclusive, royalty-freelicense in and to anycopyright covering thispaper.

    0066-4197/05/1215-0309$20.00

    Key Words

    homolog, ortholog, paralog, pseudoortholog, pseudoparalog,xenolog

    AbstractOrthologs and paralogs are two fundamentally different types of ho-mologous genes that evolved, respectively, by vertical descent froma single ancestral gene and by duplication. Orthology and paral-ogy are key concepts of evolutionary genomics. A clear distinctionbetween orthologs and paralogs is critical for the construction of arobust evolutionary classification of genes and reliable functional an-notation of newly sequenced genomes. Genome comparisons showthat orthologous relationships with genes from taxonomically dis-tant species can be established for the majority of the genes fromeach sequenced genome. This review examines in depth the defini-tions and subtypes of orthologs and paralogs, outlines the principalmethodological approaches employed for identification of orthologyand paralogy, and considers evolutionary and functional implicationsof these concepts.

    309

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Contents

    INTRODUCTION. . . . . . . . . . . . . . . . . 310HISTORY, DETAILED

    DEFINITIONS, ANDCLASSIFICATION OFORTHOLOGS ANDPARALOGS . . . . . . . . . . . . . . . . . . . . . 311A Super-Brief History of

    Homology . . . . . . . . . . . . . . . . . . . . 311Orthology and Paralogy:

    Definitions and Complications . 312IDENTIFICATION OF

    ORTHOLOGS ANDPARALOGS: PRINCIPLES ANDTECHNIQUES . . . . . . . . . . . . . . . . . 316

    EVOLUTIONARY PATTERNS OFORTHOLOGY ANDPARALOGY . . . . . . . . . . . . . . . . . . . . . 321Coverage of Genomes in Clusters

    of Orthologs . . . . . . . . . . . . . . . . . . 321One-to-One Orthologs and

    Inparalogs . . . . . . . . . . . . . . . . . . . . 322Orthologous Clusters and the

    Molecular Clock . . . . . . . . . . . . . . 323Xenologs, Pseudoorthologs, and

    Pseudoparalogs . . . . . . . . . . . . . . . 324Protein Domain Rearrangements,

    Gene Fusions/Fissions, andOrthology . . . . . . . . . . . . . . . . . . . . 327

    FUNCTIONAL CORRELATES OFORTHOLOGY ANDPARALOGY . . . . . . . . . . . . . . . . . . . . . 330

    GENERAL DISCUSSION. . . . . . . . . . 331Orthology and Paralogy as

    Evolutionary Inferences andthe Homology Debates . . . . . . . . 331

    Generalized Concepts ofOrthology and Paralogy . . . . . . . 332

    CONCLUSIONS. . . . . . . . . . . . . . . . . . . 333

    INTRODUCTION

    One of the most fascinating aspects of mod-ern genomics is the radical change it brings toevolutionary biology. The availability of mul-

    tiple, complete genomes of diverse life formsfor comparative analysis provides a qualita-tively new perspective on homologous rela-tionships between genes. By comparing thesequences of all genes between genomes fromdifferent taxa and within each genome, it is,in principle, possible to reconstruct the evo-lutionary history of each gene in its entirety(within the set of sequenced genomes). This,in turn, will allow a deeper understanding ofthe general trends in the evolution of genomiccomplexity and lineage-specific adaptations.Gene histories must be presented in the formof scenarios that comprise several types of el-ementary events (55, 64, 84). The elementaryevents of gene evolution can be classified asfollows, roughly in the order of relative con-tribution to the evolutionary process: (i) ver-tical descent (speciation) with modification;(ii) gene duplication, also followed by descentwith modification; (iii) gene loss; (iv) hori-zontal gene transfer (HGT); and (v) fusion,fission, and other rearrangements of genes.Vertical descent and duplication might beconsidered the primary events of genome evo-lution and have been well recognized in thepregenomic era. In contrast, the crucial evo-lutionary importance of gene loss, HGT, andgene rearrangements was among the major,fundamental generalizations of the emergingevolutionary genomics (13, 14, 16, 50, 51, 57,77, 78).

    Along with the notion of elementary evo-lutionary events, all descriptions of evolu-tion of genes, gene ensembles, and, ultimately,complete gene repertoires of organismsrest on certain key concepts of evolution-ary biology, primarily, the definitions of ho-mologs, orthologs, and paralogs. Developedlong ago by evolutionists, these related con-cepts and terms have reemerged and have be-come the subject of intense debate and nu-merous misunderstandings with the advent ofmolecular evolution and, subsequently, evo-lutionary genomics (24, 25, 40, 46, 76, 80,81, 97). The aversion of some biologists toideas and terms deriving from evolution-ary biology is reflected in the peculiar word

    310 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    “homologuephobia,” albeit used in a tongue-in-cheek manner (80).

    Homology, the most general definition,designates a relationship of common descentbetween any entities, without further speci-fication of the evolutionary scenario. Accord-ingly, the entities related by homology, in par-ticular, genes, are called homologs. The othertwo key terms define subcategories of ho-mologs. Orthologs are genes related via speci-ation (vertical descent), whereas paralogs aregenes related via duplication (23). The com-bination of speciation and duplication events,along with HGT, gene loss, and gene rear-rangements, entangle orthologs and paralogsinto complex webs of relationships. Correct,coherent usage of these terms would be of cer-tain importance if only to provide clarity tothe descriptions of genome evolution. How-ever, beyond semantics, these concepts havedistinct and important evolutionary and func-tional connotations.

    In this review, I discuss the intricacies ofthe definitions of orthologs and paralogs, in-cluding several derivative categories of ho-mologs and the respective terms, approachesused for identification of orthologs and par-alogs, and the functional implications oforthologous and paralogous relationships.

    HISTORY, DETAILEDDEFINITIONS, ANDCLASSIFICATION OFORTHOLOGS AND PARALOGS

    A Super-Brief History of Homology

    The term homolog was introduced by RichardOwen in 1843 to designate “the same organ indifferent animals under every variety of formand function.” Owen clearly distinguished ho-mologs from analogs, which he defined as a“part or organ in one animal which has thesame function as another part or organ in adifferent animal” (72). He attributed homolo-gies to the existence of the same “archetype”(structural plan) in all vertebrates but, not be-ing an evolutionist, did not consider the no-

    Homologs: genessharing a commonorigin

    Orthologs: genesoriginating from asingle ancestral genein the last commonancestor of thecompared genomes

    Paralogs: genesrelated viaduplication

    tion of common origin of homologous or-gans (74). Homology had been immediatelyreinterpreted after the publication of Darwin’sOrigin of Species (8). Darwin himself never usedthe term homology, but less than a year afterthe publication of the Origin, Huxley, in hisreview of Darwin’s work, invoked homologyas evidence of evolution (37).

    Leaping forward a century, the distinc-tion between orthologs and paralogs and theterms themselves were introduced by WalterFitch in 1970 in a now classic paper (23).However, in the early 1960s, these conceptswere considered in a sufficiently clear form,albeit with the use of different and somewhatawkward wording, in the prescient work ofZuckerkandl & Pauling, which laid the foun-dations of molecular evolution as a discipline(104, 105). A parallel line of relevant devel-opments involved theoretical and empiricalstudies of gene duplications and their rolein evolution. Although the idea of duplica-tion and its contribution to evolutionary in-novation was already present in Fisher’s clas-sic work of 1928 (22), the coherent conceptwas developed in Ohno’s famous 1970 bookEvolution by Gene Duplication (69). Ohno ar-gued that gene duplication, i.e., formation ofparalogous genes, is the main process respon-sible for the emergence of functional noveltyduring evolution whereby one of the new-born paralogs escapes the pressure of pre-existing constraints (purifying selection) andbecomes free to evolve a new function. Ina subsequent section, I briefly discuss themodern theoretical developments that pro-vide a better-supported, more nuanced per-spective on the role of gene duplication inevolution. These advances notwithstanding,Ohno’s principal message has certainly with-stood the test of time and got a second windthrough the discovery of omnipresent dupli-cations in genomes.

    For 20 years after Fitch developed thenotions of orthologs and paralogs, these def-initions quietly stayed within the domain ofevolutionary biology. A search of the PubMeddatabase (National Center for Biotechnology

    www.annualreviews.org • Orthologs and Paralogs 311

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Information, NIH, Bethesda) shows that be-tween 1971 and 1990, the term ortholog(ue)was used 35 times and the term paralog(ue)only 5 times. This is the low bound of theusage of these terms because many old issuesof biological journals, including SystematicZoology which published Fitch’s article, arenot in PubMed. Nevertheless, these numbersclearly show that orthologs and paralogswere hardly in vogue in the pregenomic era.Indeed, according to the Science Citation In-dex (http://isiwok.cit.nih.gov/portal.cgi),Fitch’s article was cited only 48 times in thefirst 20 years of its postpublication life.

    It all changed almost overnight in 1995,when the first two complete genome se-quences of cellular life forms, the bacteriaHaemophilus influenzae and Mycoplasma geni-talium, were released (26, 28). Figure 1 showsthe striking increase in the usage of the terms“ortholog” and “paralog” during the past 14years, illustrating the conspicuous change offortune that genomics brought about for theseconcepts. Suffice it to say that, in the currentversion of the PubMed database (which in-cludes publications in all areas of biology andmedicine as well as chemistry and physics),more than 1 in every 1000 papers includes“ortholog(ue)” in its title or abstract.

    Figure 1The time dynamics of the usage of the terms “ortholog” and “paralog”.The PubMed database was searched using the Entrez search engine withthe following queries: “ortholog or orthologs or orthologue ororthologues” and “paralog or paralogs or paralogue or paralogues” toaccommodate both the American and the British spelling of the terms.

    The advent of complete genomes necessi-tated a new language in which to discuss therelationships between genomes meaningfully,i.e., the parlance of evolutionary genomics.I would posit that orthology is the keystonedefinition of evolutionary genomics and par-alogy is the paramount, complementary no-tion (the graphs in Figure 1 suggest the pri-macy of orthologous relationships by showingthat the term ortholog is used several timesmore often than paralog). Indeed, in orderto make any conclusions regarding evolu-tion of genomes, one first must establish, asprecisely as possible, the correspondence be-tween genes in the compared genomes, i.e.,orthologous relationships that are inextrica-bly intertwined with paralogous relationships.These constitute the framework on which anyevolutionary anomalies and unique events canbe mapped. In the following section, I discussthe computational approaches developed todisentangle orthologous and paralogous re-lationships. Here, I present a general, theo-retical breakdown of evolutionary situationsin which orthology and paralogy reveal theirvarious faces.

    Orthology and Paralogy: Definitionsand Complications

    Orthologs are genes derived from a single an-cestral gene in the last common ancestor of thecompared species. This short, simple defini-tion includes two distinct, explicit statementsthat are important to rationalize; furthermore,it does not include other requirements thatmight seem natural but are not actually in-trinsic to orthology. First, the requirement ofa single ancestral gene is central to the con-cept of orthology. Once the ancestral genomeis shown to have contained two paralogousgenes that gave rise to the genes in question,it will be incorrect to consider the latter or-thologs, even if, on some occasions, there maybe the appearance of orthology (see below).Second, the definition specifies the presenceof an ancestral gene in the last common an-cestor of the compared species rather than

    312 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    in some arbitrary, more ancient ancestor. Ofcourse, this definition assumes the existence ofa distinct common ancestor of the comparedspecies, a proposition sometimes challengedfor prokaryotes owing to the high incidenceof HGT (see discussion below). This is re-strictive and might exclude cases where genesbehave like orthologs in the evolutionary andfunctional senses. Nevertheless, we shall stickto the above definition of orthology for thepresent discussion. An important statementthat might at first seem natural is not includedin the definition of orthology: There is no re-quirement that orthology is a one-to-one re-lationship. The ensuing discussion shows thatsuch a restriction would have been artificialand meaningless. Of even greater import isthe connection between orthology and bio-logical function. It should be emphasized thatthe above definition has nothing to do withfunction. However, a crucial property of or-thologs, which is both theoretically plausibleand empirically supported, is that they typi-cally perform equivalent functions in the re-spective organisms (I avoid the phrase “identi-cal functions” because, in different biologicalcontexts, functions cannot be literally thesame). In a subsequent section, I discuss insome detail both the available evidence ofthe functional equivalency of orthologs andsome notable exceptions. Emphasized here isthe asymmetry of the relationship betweenorthology and function: Orthologs most of-ten have equivalent functions, but the reversestatement is much weaker. Situations whenequivalent functions are performed by non-orthologous (often, non-homologous) pro-teins are common enough as captured inthe notion of non-orthologous gene displace-ment, i.e., recruitment of non-orthologousgenes for the equivalent, essential functionsin different organisms (47, 52). Thereforeit is unadvisable (to put it mildly) to speakof “functional orthologs” whereby functionalequivalency is taken as a proxy for orthol-ogy; of course, phrases such as “orthologousgenes with the same function” are quite legit-imate (disregarding, as a subtlety, the above

    distinction between equivalent and identicalfunctions).

    Paralogs are genes related via duplication.Note the generality of this definition, whichdoes not include a requirement that paralogsreside in the same genome or any indicationas to the age of the duplication leading to theemergence of paralogs (some of these dupli-cations occurred at the earliest stages of life’sevolution but the respective genes neverthe-less qualify as paralogs). As in the case of or-thology, the definition of paralogy does not re-fer to biological function, but there are majorfunctional connotations. Generally, paralogsperform biologically distinct, even if mecha-nistically related, functions. Functional differ-entiation of paralogs is a complex subject thathas been addressed in numerous theoreticaland empirical studies; a brief synopsis is givenin a subsequent section.

    Figure 2 shows a hypothetical phyloge-netic tree of a gene family that consists of

    Figure 2A hypothetical phylogenetic tree illustrating orthologous and paralogousrelationships between three ancestral genes and their descendants in threespecies. LCA, last common ancestor (of the compared species).

    www.annualreviews.org • Orthologs and Paralogs 313

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Co-orthologs: twoor more genes in onelineage that are,collectively,orthologous to oneor more genes inanother lineage dueto a lineage-specificduplication(s)

    Outparalogs:paralogous genesresulting from aduplication(s)preceding a givenspeciation event

    Inparalogs:paralogous genesresulting from alineage-specificduplication(s)subsequent to a givenspeciation event

    three branches, each illustrating a distinctcase of orthologous-paralogous relationships.Note first of all, that under the evolutionaryscenario illustrated by the tree in Figure 2,the common ancestor of the entire family ex-isted prior to the last common ancestor of allthree compared species. The latter already en-coded three paralogous genes from the givenfamily, which became the progenitors of thethree branches of the tree. Thus, each gene inbranch 1 is a paralog of each gene in branches2 and 3, and vice versa. Branch 1 correspondsto a straightforward case whereby evolutionfrom the last common ancestor involved no-thing but vertical inheritance. Accordingly,the genes in different species are orthologousto each other and, moreover, show a one-to-one orthologous relationship. However, thisis only a specific and not the most commonform of orthology (at least when large sets ofspecies are analyzed). Branch 2 shows, in ad-dition to the pattern of vertical inheritance, alineage-specific duplication in species A. This

    Figure 3A hypothetical phylogenetic tree illustrating emergence ofpseudoorthologs via lineage-specific gene loss.

    simple situation, nevertheless, requires the in-troduction of a new group of terms. Sincethe duplication occurred in a single lineageafter the radiation of the analyzed species,the paralogs in species A fit the definitionof orthology with respect to all other genesin this branch. Accordingly, genes YA1 andYA2 are coorthologs (85) of the genes YBand YC. In branch 3, the situation is furthercomplicated by lineage-specific duplicationsin each of the species. Thus, genes ZA1-3are, collectively, co-orthologs of genes ZB1-2, etc. The scheme in Figure 2 also pointsto different classes of paralogs with respectto speciation events. Relative to each speci-ation event, it makes sense to define outpar-alogs (alloparalogs), which evolved via ancientduplication(s) preceding the given speciationevent (genes X, Y, and Z in Figure 1), and in-paralogs (symparalogs), which evolved morerecently, subsequent to the speciation event(e.g., genes YA1 and YA2 relative to the radi-ation of species A and B). Even more complexevolutionary scenarios emerge when duplica-tions are associated with internal branches of aphylogenetic tree (rather than with the termi-nal branches corresponding to species) suchthat certain gene sets are inparalogs relativeto one speciation event and outparalogs rela-tive to another.

    The terms coorthologs, inparalogs, andoutparalogs are relatively new (85) and so farhave not been widely adopted. Nevertheless,they seem to be helpful for concise and ac-curate description of the evolutionary processand functional diversification of genes.

    Let us now examine the effect of line-age-specific gene loss on the orthologousand paralogous relationships between genes;Figure 3 shows a hypothetical example ofdifferential gene loss in two lineages obscur-ing these relationships. This hypothetical sce-nario starts with two genes (X and Y) thatare outparalogs relative to the included spe-ciation event. Subsequently, gene Y is lost inspecies A, whereas gene X is lost in speciesD; the species B and C retain both paralogs(Figure 3). By comparing species A and D in

    314 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    isolation, we might conclude that gene XA isthe ortholog of gene YD. Under the scenarioin Figure 3, this conclusion is obviously false:These genes are paralogs because they evolvedfrom two paralogous genes of the last com-mon ancestor of the compared species (out-paralogs). By analyzing the entire set of repre-sentatives of the given gene family in the fourspecies and applying the parsimony princi-ple, we can infer the correct evolutionary sce-nario and, accordingly, draw the conclusionthat the genes XA and YD are, actually, out-paralogs relative to the divergence of speciesA and D; these genes only mimic orthologyand, for convenience, we may call them pseu-doorthologs. Such conclusions are interest-ing not only from the evolutionary stand-point but may also have substantial functionalimplications.

    Now we must consider the effects of HGTon the observed orthologous and paralogousrelationships between genes. As shown inFigure 4a, two species (A and B) may havehomologous genes of which one is ancestralfor the given lineage but the other has beenacquired via HGT from an outside sourceC, displacing the ancestral gene. This phe-nomenon has been dubbed xenologous genedisplacement [XGD; (51)]. In an even morecomplex case, both genes might have been ac-

    Xenologous genedisplacement:displacement of agene in a givenlineage with amember of the sameorthologous clusterfrom a distantlineage (xenolog)

    Pseudoparalogs:homologous genesthat come out asparalogs in asingle-genomeanalysis but actuallyended up in the givengenome as a result ofa combination ofvertical inheritanceand HGT

    quired from different sources. In a perfunc-tory analysis, such a pair of genes (XA and XBin Figure 4a) would mimic orthologs. Ob-viously, however, these genes do not fit theabove definition of orthology because theydo not come from a single ancestral gene inthe last common ancestor of the comparedspecies. To designate such pseudoorthologsacquired from different sources, the rarelyused but natural and useful term xenologs hasbeen proposed (32, 33, 76). As with pseu-doorthology caused by lineage-specific geneloss, distinguishing xenology from true or-thology may be hard, if possible at all, in a pair-wise genomic comparison. However, whenmultiple genomes are analyzed such that wecan, if only roughly, identify the origin ofeach gene, xenologs and true orthologs be-come distinguishable.

    A related but distinct situation transpireswhen a species (B) acquires via HGT agene homologous to a gene that it alreadyhas, without the latter gene being displaced(Figure 4b). The result of such an event couldreasonably be described as pseudoparalogy(such that XB1 and XB2 are pseudoparalogs)because the homologous genes in species Bare not paralogs under the simple defini-tion given above: They have not evolved viagene duplication at any stage of evolution.

    Figure 4Effect of horizontalgene transfer onorthology andparalogy. (a) Ahypotheticalevolutionaryscenario with HGTleading to xenology.(b) A hypotheticalevolutionary scenariowith HGT leadingto pseudoparalogy.LUCA, LastUniversal CommonAncestor (of allextant life forms).

    www.annualreviews.org • Orthologs and Paralogs 315

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    As discussed below, there are situations, par-ticularly those that involve endosymbiosis,where pseudoparalogy caused by HGT iscommon.

    The concept of orthology is further com-plicated by another phenomenon that is com-mon in genome evolution, gene fusion, as wellas the complementary process of gene fission(54, 83, 100). The impact of such evolution-ary events on orthology is that different parts(often encoding distinct domains) of genesin one species are orthologous to differentgenes in another species. Thus, a new typeof orthologous relationship emerges wherebya gene ceases to be the atomic unit of orthol-ogy. Further implications of this shift in mean-ing are considered in the final section of thisreview.

    Table 1 summarizes the meaning and areaof applicability of each term introduced inthis section. This system of definitions andterms is more complex than the simple di-chotomy of orthology and paralogy, withsubcategories of paralogs (in- and outpar-

    alogs) as well as more specialized notions,xenologs, pseudoorthologs, and pseudopar-alogs, additionally introduced. The advan-tage, I believe, is that this system seems tobe complete, i.e., includes definitions that ap-ply to every logically imaginable case of ho-mologous relationships between genes. Wenow turn to the empirical manifestations andimplications of this system in evolutionarygenomics.

    IDENTIFICATION OFORTHOLOGS AND PARALOGS:PRINCIPLES AND TECHNIQUES

    As soon as genome comparison became apractically important task, the questions aroseas to how should one delineate the set of cor-responding genes (or, inaccurately, but intu-itively, “the same” genes in different species),i.e., orthologs. Indeed, evolutionary classifi-cation of genes, which can only be basedon the concepts of orthology and paralogy,becomes a must as soon as several genome

    Table 1 Homology: terms and definitions

    Homologs Genes sharing a common originOrthologs Genes originating from a single ancestral gene in the last common ancestor

    of the compared genomes.Pseudoorthologs Genes that actually are paralogs but appear to be orthologous due to

    differential, lineage-specific gene loss.Xenologs Homologous genes acquired via XGD by one or both of the compared

    species but appearing to be orthologous in pairwise genome comparisons.Co-orthologs Two or more genes in one lineage that are, collectively, orthologous to one

    or more genes in another lineage due to a lineage-specific duplication(s).Members of a co-orthologous gene set are inparalogs relative to therespective speciation event.

    Paralogs Genes related by duplicationInparalogs (symparalogs) Paralogous genes resulting from a lineage-specific duplication(s)

    subsequent to a given speciation event (defined only relative to aspeciation event, no absolute meaning).

    Outparalogs(alloparalogs)

    Paralogous genes resulting from a duplication(s) preceding a givenspeciation event (defined only relative to a speciation event, no absolutemeaning).

    Pseudoparalogs Homologous genes that come out as paralogs in a single-genome analysisbut actually ended up in the given genome as a result of a combination ofvertical inheritance and HGT.

    316 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    sequences become available.1 Individual char-acterization of each gene in each genomerapidly turns into a gargantuan task for com-putational analysis and completely impracti-cal experimentally as more genomes are se-quenced (12, 30). Comparative genomics canbe feasible and meaningful only if the num-ber of distinct entities to be analyzed is sub-stantially reduced by introducing a rationalclassification of genes. A natural way to do sois to delineate sets of orthologous (includingco-orthologous) genes. The extent to whichthis is going to help in genome analysis criti-cally depends on the nature of the relation-ships between genomes of different speciesthat only can be deciphered empirically. Totake one extreme, if all genes in comparedgenomes formed perfect clusters of one-to-one orthologs, the number of entities to studywould be equal to the number of genes in eachgenome and would remain constant regardlessof how many new genomes were sequenced.Should that be the case, the entire enterpriseof comparative-evolutionary genomics wouldbe straightforward to the point of being trivial.On the other end of the spectrum, should thenumber of identifiable clusters of orthologsbe small compared with the total numberof genes in genomes, comparative genomicswould be in serious trouble.

    Since orthology and paralogy are defini-tions that are inextricably coupled to cer-tain types of evolutionary events (speciationand duplication, respectively), the classicalscheme for identifying orthologs involves

    1Note that the first efforts to delineate sets of orthologousgenes shared by relatively large genomes were undertakenyears before the appearance of complete genome sequencesof cellular life forms. Indeed, this was done as soon as thefirst pair of related genomes containing many (on the or-der of 100–200) genes became available, namely, when thegenome of varicella-zoster virus was sequenced in 1986 andcompared with the previously sequenced genome of an-other herpesvirus, Epstein-Barr virus (10). Subsequently, acore set of conserved (orthologous) genes was delineatedfor several herpesviruses (35). However, in these studies,the conceptual basis of comparative genomics was not con-sidered explicitly and neither were the terms orthologs andparalogs used.

    phylogenetic analysis and, in particular, theprocedure generally known as tree reconcili-ation (20, 63, 73, 102). Under this approach,the topology of a gene tree is compared withthat of the chosen species tree and the twoare reconciled on the basis of the parsimonyprinciple, by postulating the minimal possiblenumber of duplication and gene-loss eventsin the evolution of the given gene. The rec-onciled tree is expected to reflect orthologousrelationships. However, genome-wide appli-cation of this approach is effectively precludedby both fundamental and practical difficulties.The principal obstacle faced by tree reconcil-iation (and any other phylogenetic approach)as a strategy for ortholog identification is theprevalence of HGT, especially in prokaryotes.Strictly speaking, the wide spread of HGT in-validates the very notion of a species tree, al-lowing, at best, the use of various forms ofconsensus trees for multiple genes as surro-gates of the species tree (15, 16, 98). Further-more, even when a particular tree topologyis taken as the species tree, the possibility ofHGT of the analyzed gene undermines theconcept of reconciliation because the topolo-gies of the two trees are likely to be gen-uinely different. At a more practical level, evenwhen HGT is not considered to be a ma-jor factor, as in the evolution of eukaryotes,both the species tree and many gene trees arefraught by uncertainties and artifacts. Evenmore practically, fully automated construc-tion and analysis (with appropriate reliabil-ity tests) of trees for all genes in sequencedgenomes is a major challenge for software en-gineering and is expensive computationally.

    Further in this section, I discuss several at-tempts at explicit phylogenetic classificationof orthologs and paralogs. However, giventhe substantial difficulties faced by these ap-proaches, most genome-wide studies to dateemploy simplifications and shortcuts. Thesimple but crucial assumption that underliessuch “surrogate” approaches is that the se-quences of orthologous genes (proteins) aremore similar to each other than they are toany other genes (proteins) from the compared

    www.annualreviews.org • Orthologs and Paralogs 317

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Pseudoorthologs:genes that actuallyare paralogs butappear to beorthologous due todifferential,lineage-specific geneloss

    genomes, i.e., they form symmetrical besthits (SymBets). Conversely, it is assumedthat SymBets are most likely to be formedby orthologs, suggesting a very simple andstraightforward method for identification oforthologs. For brevity, I call these two as-sumptions taken together the SymBet hy-pothesis. It seems highly plausible, basedon the definition of orthology and the no-tion that orthologs typically occupy the samefunctional niche, that the SymBet hypothesisis true, at least statistically. Consider the alter-native: A gene from one genome is most sim-ilar not to its ortholog but to a paralog fromanother genome. As shown in Figure 5a, thisrequires a substantial difference in the ratesof evolution of paralogs sufficient to offsetthe divergence of paralogs prior to speciationsuch that, using the notation of Figure 5a,ASx > SDx + SDy + BSy (A and B are twospecies; S stands for speciation and D for du-plication; ASx and BSy are the amounts ofdivergence accumulated, respectively, by thegenes XA and YB after speciation; SDx andSDy are the amounts of divergence accumu-lated by the paralogous genes x and y afterduplication but prior to the speciation). It isgenerally unlikely that one paralog evolves somuch faster than another, given that paralogsretain similar functions. Although asymme-try in the evolution of paralogs has been de-tected in some studies, typically, it is relativelysmall. Furthermore, even should that be thecase, only the first assumption of the SymBethypothesis will be violated. In other words,if one paralog evolves much faster than theother, this could lead to a false negative un-der the SymBet method for ortholog detec-tion (a pair of orthologs missed) but not to afalse positive (no erroneous detection of or-thologs). Lineage-specific gene duplicationsproducing inparalogs are likely to be a morecommon cause of violation of the first as-sumption of the SymBet hypothesis; these du-plications may obscure orthologous relation-ships leading to false negatives (Figure 5b). Itseems that the only realistic situations whenthe second assumption of the SymBet hypoth-

    esis can be violated are the cases of pseu-doorthology and xenology (Figures 3 and4). Pseudoorthologs and xenologs will likelyform a SymBet and produce a false positiveunder this approach to orthology detection.However, even though formally, neither pseu-doorthologs nor xenologs are orthologs, thereis a subtle difference between these two situ-ations. Unlike pseudoorthologs, which nevercan be considered orthologous given their ori-gin by duplication, xenologs come across asorthologs in comparisons of genomes fromthe donor and recipient lineages (i.e., XA isthe ortholog of XC in Figure 4). Accordingly,xenology is likely to be a more reliable pre-dictor of gene function that pseudoorthology(see discussion below).

    Given these uncertainties, empirical re-sults on the prevalence of SymBets in genomecomparisons are important to assess the levelof one-to-one orthology between genomesthat is critical for evolutionary and functionalgenomics. Table 2 shows the number of Sym-Bets between prokaryotic genomes separatedby varying evolutionary distances. These re-sults clearly demonstrate that a one-to-one or-thologous relationship is a major rather than aminor pattern in genome evolution. For rela-tively closely related species, e.g., different γ -Proteobacteria, the fraction of probable one-to-one orthologs identified as SymBets typi-cally is >0.5. Predictably, the fraction of genesthat produce SymBets drops with evolution-ary distance but remains substantial (20%–30% of the genes) even between bacteria andarchaea. This fraction also depends on the to-tal number of genes in a genome such thatsmall genomes with few paralogs (e.g., My-coplasma genitalium) show a high level of one-to-one orthology even with distant species.Detection of SymBets is arguably the simplestmethod for the identification of probable or-thologs that is most suitable for closely relatedgenomes but also serves well at greater evo-lutionary distances for the specific purpose ofdetecting one-to-one orthologs.

    The demonstration that numerous genesin sequenced genomes produced SymBets

    318 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Figure 5Orthology and genome-specific best hits. (A) An evolutionary scheme illustrating the connectionbetween orthology and symmetrical best hits (SymBets). X and Y represent two paralogous genes. Thebranch lengths in the tree are taken to reflect evolutionary distances between the compared genes, andthe formulas for the distances between orthologs and paralogs are given. A molecular clock is assumedfor the evolution of orthologs but not paralogs. A and B are two species; D indicates a duplication eventand S indicates a speciation event; ASx and BSy are the amounts of divergence accumulated, respectively,by the genes XA and YB after speciation; SDx and SDy are the amounts of divergence accumulated bythe paralogous genes x and y after duplication but prior to the speciation. (B) An evolutionary schemeillustrating violation of the SymBet relationship caused by a lineage-specific duplication. Arrowsdesignate best hits; circles of similar shades represent inparalogs. X, Y, and Z designate three cases of(co)orthologous relationships: one-to-one (X), one-to-many (Y) and many-to-many (Z).

    www.annualreviews.org • Orthologs and Paralogs 319

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Table 2 Symmetrical best hits between selected prokaryotic genomesa

    Ec Yp Hp Bs Mg Aa Tm Mj Ma TaEc-4289 0.584 0.456 0.305 0.527 0.519 0.428 0.24 0.151 0.308Yp-4083 2385 0.432 0.28 0.525 0.496 0.423 0.218 0.144 0.275Hp-1566 714 677 0.403 0.442 0.396 0.321 0.181 0.211 0.176Bs-4100 1251 1144 631 0.648 0.495 0.465 0.239 0.153 0.306Mg-480 253 252 212 311 0.469 0.515 0.235 0.281 0.238Aa-1553 806 771 615 768 225 0.449 0.279 0.33 0.256Tm–1846 808 780 503 858 247 697 0.245 0.294 0.265Mj-1770 425 385 284 423 113 434 433 0.489 0.362Ma-4540 649 589 330 627 135 513 543 866 0.415Ta-1478 455 406 260 453 114 378 392 535 614

    aThe bottom half of the table shows the number of SymBets for each pair of genomes and the upper half shows thefraction of proteins in the smaller genome that form SymBets with the putative orthologs from the larger genome.Species abbreviations are as follows. Proteobacteria: Ec, Escherichia coli; Yp, Yersinia pestis; Hp, Helicobacter pylori:Gram-positive bacteria: Bs, Bacillus subtilis; Mg, Mycoplasma genitalium: deep-branching, hyperthermophilic bacteria:Aa, Aquifex aeolicus; Tm, Thermotoga maritima; archaea: Mj, Methanocaldococcus jannaschii; Ta, Thermoplasmaacidophilum; Ma, Methanosarcina acetivorans. For each species, the total number of protein-coding genes is indicated.

    even between relatively distant species (90)made it clear that construction of a genome-wide evolutionary classification of ortholo-gous and paralogous genes was a feasible taskeven if the SymBet approach itself is insuffi-cient for this purpose owing to the prevalenceof inparalogs. To my knowledge, such a classi-fication was first implemented in the system ofthe so-called Clusters of Orthologous Groupsof proteins (COGs) (89). The phrase “orthol-ogous groups,” which has been subsequentlycriticized as “terminology muddle” (71), wasintended to emphasize that the system cap-tured not only one-to-one orthologous re-lationships but also coorthologous relation-ships between inparalogs. The idea behind theCOG approach was to generalize and extendthe notion of a genome-specific best hit. First,the requirement for the reciprocity of besthits (as in SymBets) was abandoned becauseof the in-paralog problem (see Figure 5b).Second, the notion of a genome-specific besthit was extended to multiple genomes suchthat the algorithm sought to identify clustersof consistent best hits. More specifically, theCOG construction procedure is based on theassumption that any set of at least three pro-teins from relatively distant genomes that are

    more similar to each other than they are toany other proteins from the same genomesare most likely bona fide orthologs. This pre-diction holds even if sequence similarity be-tween some of the compared proteins is rela-tively low and, accordingly, even fast-evolvinggenes can be incorporated into the COGs.Briefly, the procedure for COG constructionconsists of the following steps. (i) An all-against-all comparison of protein sequencesencoded in multiple genomes (typically us-ing the BLAST program). (ii) Detection andclustering of obvious inparalogs, i.e., proteinsfrom the same genome that are more simi-lar to each other than they are to any proteinsfrom other species. (iii) Identification of trian-gles of mutually consistent, genome-specificbest hits such that clusters of inparalogs de-tected at step 2 are treated as single entities.(iv) Merging triangles with a common side toform COGs.

    The COG approach neatly delineates clus-ters of probable orthologs that include inpar-alogs in relatively few lineages. However, theprocedure tends to err toward overlumping inthe case of large protein families that includea complex mix of in- and outparalogs. Ad-ditional complications emerge in the case of

    320 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    multidomain proteins that also may artificiallybridge unrelated COGs. Several other ap-proaches for identification of orthologs, basedon either specially designed clustering pro-cedures or on explicit phylogenetic analysis,have been developed to overcome these prob-lems and better disentangle orthologs andparalogs. In particular, the INPARANOIDprocedure developed by Sonnhammer andcoworkers identifies clusters of orthologs, in-cluding co-orthologous sets of inparalogs, forpairs of genomes, by first detecting Sym-Bets and then incorporating additional inpar-alogs according to developed statistical crite-ria (67, 82). High accuracy of identificationof inparalogs seems to be achievable with thisapproach, but the inability to handle multi-ple genomes simultaneously is a serious limi-tation. Another method for ortholog detec-tion developed by the same group involvescomparison of gene trees with species trees,with the goal of direct identification of or-thologs (86). The parts of the gene tree thathave the same topology as the species treeare inferred to include orthologs. In princi-ple, this and similar phylogenomic [i.e., ap-plying phylogenetic analysis on genome scale(19)] methods are supposed to provide thestrongest and most direct evidence of or-thology. A fundamental drawback is, how-ever, the uncertainty of the species tree inthe case of prokaryotes due to the prevalenceof HGT (this does not appear to be a prob-lem in the case of eukaryotes). In addition,the method is computationally expensive andsensitive to tree construction artifacts. Never-theless, this approach has been applied to theeukaryotic subset of the Pfam database of pro-tein families, yielding numerous (inferred) or-thologous domains (87). A very similar auto-mated phylogenomic procedure for inferenceof orthologs has been developed by Zmasek& Eddy (103). In a more recent develop-ment, a Bayesian probabilistic technique hasbeen introduced to assign probabilities to theorthology identifications (3). A major effortto identify orthologs and paralogs has beenundertaken by Perrière and coworkers who

    constructed databases of homologous genesfrom vertebrates (HOVERGEN) and bacte-ria (HOBACGEN) in which families of ho-mologs (each consisting of a mix of orthologsand paralogs) are accompanied by phyloge-netic trees (18, 79). Very recently, tools havebeen developed for tree reconciliation in theframework of the database, with the goal ofidentifying sets of orthologs (17). The effec-tiveness of these methods on genome scale re-mains to be assessed.

    In summary, although phylogenomicmethods, in principle, should be best suitedfor deciphering orthologous and paralogousrelationships, in practice, these approachesso far have not matured enough to producea comprehensive collection of orthologous-paralogous clusters covering multiple species.Such collections have been constructed onlywith methods based on sequence similarityand the notion of genome-specific best hits.Clusters produced with these approaches areby no means error-free, in particular withrespect to lumping some of the ortholo-gous gene sets into inflated, mixed clustersof orthologs and paralogs. However, exten-sive work on genome annotation as well asgenome-wide evolutionary studies performedwith the help of these systems (50) suggestthat they are sufficiently robust for extractingmeaningful evolutionary and functional pat-terns (discussed below).

    EVOLUTIONARY PATTERNS OFORTHOLOGY AND PARALOGY

    Coverage of Genomes in Clustersof Orthologs

    Probably, the aspect of orthologous clustersthat is of the most immediate importanceto evolutionary and functional genomics isthe coverage of genomes, i.e., the fractionof genes with orthologs in other species.A substantial majority of genes from eachsequenced prokaryotic genome (Figure 6a)and a somewhat lower fraction of eukary-otic genes (Figure 6b) belong to COGs

    www.annualreviews.org • Orthologs and Paralogs 321

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Figure 6Coverage of selected genomes with clusters of orthologous groups ofproteins (C/KOGs). (a) Prokaryotic genomes. (b) Eukaryotic genomes.The data are from (88). Filled volume, genes in C/KOGs; empty volume,genes not included in C/KOGs. Abbreviations: Bacteria: Aa, Aquifexaeolicus; Bs, B. subtilis; Ec, E. coli; Hi, Haemophilus influenzae;Mg, Mycoplasma genitalium. Archaea: Af, Archaeoglobus fulgidus;Ma, Methanosarcina acetivorans; Ss, Sulfolobus solfataricus. Eukaryotes:At, Arabidopsis thaliana; Ce, Caenorhabditis elegans; Dm, Drosophilamelanogaster; Sc, Saccharomyces cerevisiae. Data for Homo sapiens are notshown because the KOGs include an early, inflated version of the humangene set.

    (or the clusters of orthologous genes fromeukaryotes dubbed KOGs) (49, 88). Thecoverage of genomes with COGs slowly in-creases with the growing number of includedspecies (91). It remains unclear whether thelevel of orthology tends to 100% when thenumber of genomes tends to infinity orthere is a lower limit, and some genes aretrue, species-specific “orphans” evolving ina regime different from the majority of thegenes (29, 68). Obviously, however, the or-thology level is high and will only increasewith continued genome sequencing. There-fore, the reduction of search space providedby the classification of genes into ortholo-gous clusters is substantial and, in practicalterms, should be sufficient to cope with theflood of information produced by genomesequencing.

    One-to-One Orthologs andInparalogs

    As discussed above, lineage-specific dupli-cations leading to the emergence of inpar-alogs complicate orthologous relationshipsbetween genes. A simple analysis of the COGsallows one to evaluate the extent of this phe-nomenon, with the caveat that some lump-ing is involved, leading to an inflated estimateof the number of inparalogs. Figure 7 showsthe distribution of the number of paralogs inCOGs for four prokaryotic genomes. For allthe importance of lineage-specific expansionof paralogous families, in each genome themajority of orthologous lineages (COGs) arerepresented by a single gene. Specifically, inEscherichia coli, a complex bacterium with arelatively large genome, ∼71% of the COGsinclude a single gene, and in the case of M.genitalium, a bacterium with a near-minimalgenome, such COGs form the overwhelm-ing majority (∼92%). This conclusion is com-patible with the earlier quantitative analysisof lineage-specific expansions in prokaryotes,which detected only a few large expansionsin each genome (41), and with independentanalyses of the size distribution of paralogous

    322 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    families (38, 53). A qualitatively similar pat-tern, albeit with some predictable excess of in-paralogs, was observed for eukaryotic orthol-ogous clusters (data not shown). Moreover,1769 of the 4873 COGs (36%) contain exactlyone representative from each of the includedgenomes. It has been proposed that one-to-one orthology could be selected for in the caseof genes encoding subunits of macromolecu-lar complexes requiring strict stoichiometrydue to the deleterious effect of subunit imbal-ance (75, 94). Indeed, orthologous sets con-taining no paralogs appear to be significantlyenriched in complex subunits (49, 75, 95).

    Orthologous Clusters and theMolecular Clock

    A central tenet of Kimura’s neutral theory isthat the rate of evolution of a gene remainsthe same (with some dispersion, obviously)as long as the biological function does notchange (44). Kimura never used the term or-tholog (or paralog), but this generalizationobviously applies to orthologous gene sets,primarily those that include no inparalogs.The subsequent evolution of the molecularclock concept involved considerable dispute,with numerous studies demonstrating sub-stantial overdispersion of the clock (5, 6, 31).Genome-wide tests of the molecular clockhave been conducted only very recently. Oneapproach involved comparing the evolution-ary distances within a COG containing no in-paralogs to a standard intergenomic distance,which was defined as the median of the dis-tribution of the distances between all one-to-one orthologs in the respective genomes(66). Under the molecular clock model, thepoints on a plot of intergenic distances forthe given COG versus intergenomic dis-tances will scatter around a straight line. Astatistical method was developed to identifysignificant deviations from the clock-like be-havior. Among several hundred COGs rep-resenting three well-characterized bacteriallineages, α-Proteobacteria, γ -Proteobacteria,and the Bacillus-Clostridium group, the clock

    Figure 7Distribution of the number of paralogs in COGs for selected prokaryoticgenomes. The data were extracted from the current COG version (88).The plot is shown in the double-logarithmic scale.

    Molecular clock: acentral concept ofmolecular evolution,which posits that agene evolves at aconstant rate as longas its function doesnot change

    hypothesis could not be rejected for ∼70%,whereas the rest showed substantial devia-tions. These anomalies could be explained ei-ther by lineage-specific acceleration of evo-lution or by XGD (see below). The generalconclusion from these analyses seems to bethat the majority of orthologous genes evolvein the clock-like mode as long as there was noduplication, although the frequency of excep-tions was by no means negligible.

    The connections between gene duplica-tion and evolutionary rates have been ex-plored in considerable detail and appear tobe quite complex. Ohno’s original idea wasthat, immediately after duplication, one of thenewborn paralogs is freed from purifying se-lection and would evolve rapidly such that iteither perishes or, on relatively rare occasions,evolves a new function (neofunctionalization),after which evolution slows down again (69).Subsequent theoretical and empirical analy-ses have shown that this path of evolution,while apparently realized on some occasions,is probably not the most common outcome ofa gene duplication (61). What happens moreoften seems to be relaxation of purifying selec-tion immediately after duplication, resultingin accelerated evolution in both paralogs. Thisis thought to reflect subfunctionalization, i.e.,

    www.annualreviews.org • Orthologs and Paralogs 323

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    partitioning of the different functions of themultifunctional ancestral gene between par-alogs (45, 59, 60). Somewhat paradoxically,however, two independent recent studies haveshown that despite this acceleration, genesthat have paralogs on average evolve slowerthan those that do not (9, 42). This differencemay be due to a greater likelihood of fixationof emergent paralogs among slowly evolving(more “important”) genes.

    Xenologs, Pseudoorthologs, andPseudoparalogs

    As discussed above, xenologs are homologsthat violate the definition of orthology due toHGT. More specifically, xenology is broughtabout either by XGD or by acquisition of agene that is new for the given lineage (51).When the study of the clock-like behavior oforthologous gene sets discussed in the pre-vious section was followed up with phyloge-netic analysis of deviant cases, probable XGD

    was demonstrated for 10%–20% of the COGswithin each of the examined bacterial taxa,establishing XGD as a major evolutionaryphenomenon (66). On some occasions, XGDeven takes the form of displacement in situwhereby a gene is displaced with a horizon-tally transferred distant ortholog without dis-rupting the operon structure (70). Figure 8illustrates one such case: the displacement ofthe ruvB gene (coding for the helicase subunitof the resolvasome) in the mycoplasmas withthe ortholog from ε-Proteobacteria. In thiscase, the ruvB genes of Mycoplasma and thoseof the rest of low GC gram-positive bacteria,the taxon to which Mycoplasma belong, qual-ify as xenologs given their different phyloge-netic affinities (Figure 8c). A clear exampleof acquisition of a new gene leading to xenol-ogy is the B family DNA polymerase of γ -Proteobacteria (e.g., the polB gene of E. coli).At first sight, this gene appears to be an or-tholog of the archaeal and eukaryotic B familypolymerases (see COG0417). However, these

    −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→Figure 8Xenologous displacement in situ of the ruvB gene in the mycoplasmas. (A) Organization of the Hollidayjunction resolvasome operon and surrounding genes in bacteria. COG0632, Holliday junctionresolvasome, DNA-binding subunit; COG2255, Holliday junction resolvasome, DNA-binding subunit;COG0817; Holliday junction resolvasome, endonuclease subunit; COG0392, predicted integralmembrane protein; COG0282, acetate kinase; COG0839, NADH:ubiquinone oxidoreductase subunit 6(chain J); COG0244, ribosomal protein L10; COG0732, restriction endonuclease S subunits; COG0809,S-adenosylmethionine:tRNA-ribosyltransferase-isomerase; COG0772, bacterial cell division membraneprotein; COG0624, acetylornithine deacetylase/succinyl-diaminopimelate desuccinylase and relateddeacylases; COG1487, predicted nucleic acid-binding protein; COG1132, ABC-type multidrugtransport system, ATPase, and permease components; COG0442, prolyl-tRNA synthetase; COG0323,DNA mismatch repair enzyme; COG1408, predicted phosphohydrolases. (B) Unrooted phylogenetictree for RuvA. (C) Unrooted phylogenetic tree for RuvB. Branches supported by bootstrap probability>70% are marked by black circles. Names of the genes from mosaic operons and the respectivebranches are shown in red. Species abbreviations: Atu, Agrobacterium tumefaciens; Bha, Bacillus halodurans;Bsu, Bacillus subtilis; Bbu, Borrelia burgdorferi; Bme, Brucella melitensis; Cje, Campylobacter jejuni Ccr,Caulobacter crescentus; Ctr, Chlamydia trachomatis; Cpn, Chlamydophila pneumoniae; Cte, Chlorobiumtepidum; Cac, Clostridium acetobutylicum; Cgl, Corynebacterium glutamicum; Eco, Escherichia coli;Fnu, Fusobacterium nucleatum; Hin, Haemophilus influenzae; Hpy, Helicobacter pylori; Lla, Lactococcus lactis;Lpl, Lactobacillus plantarum; Lin, Listeria innocua; Neu, Nitrosomonas europaea; Mlo, Mesorhizobium loti;Mge, Mycoplasma genitalium; Mpn, Mycoplasma pneumoniae; Mpu, Mycoplasma pulmonis; Mle,Mycobacterium leprae; Mtu, Mycobacterium tuberculosis; Nme, Neisseria meningitidis; Nsp, Nostoc sp.; Oih,Oceanobacillus iheyensis; Pae, Pseudomonas aeruginosa; Rso, Ralstonia solanacearum; Rpr, Rickettsia prowazekii;Rco, Rickettsia conorii; Sme, Sinorhizobium meliloti; Sau, Staphylococcus aureus; Spy, Streptococcus pyogenes;Ssp, Synechocystis PCC6803; Tma, Thermotoga maritima; Tte, Thermus thermophilus; Tpa, Treponemapallidum; Vch, Vibrio cholerae; Xfa, Xylella fastidiosa; Uur, Ureaplasma urealyticum. The figure is from (70)where the details of phylogenetic analysis are described.

    324 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    www.annualreviews.org • Orthologs and Paralogs 325

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    genes should be classified as xenologs becausethe proteobacterial version clearly does notderive from the last universal common ances-tor (which is also the last common ancestorof bacteria and archaea) but rather had beenacquired via HGT.

    As defined above, pseudoorthologs emergevia lineage-specific, differential loss of paral-ogous genes (Figure 2). A systematic searchfor pseudoorthologous genes requires de-tailed, genome-wide phylogenetic analysis,which to my knowledge has not yet been

    326 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    conducted. Nevertheless, some likely cases ofpseudoorthology can be gleaned by exami-nation of COGs. Consider COG0114 (fu-marase) and COG1027 (aspartate ammonia-lyase), which consist of paralogous enzymeswith a high level of sequence similarity to eachother. Both enzymes are widespread in bac-teria and, most likely, were already presentin the last common ancestor of all bacte-ria but apparently have been lost indepen-dently in many lineages. When comparingthe genomes of two cyanobacteria, Synechocys-tis sp. and Nostoc sp., the genes slr0018 ofthe former and alr3724 of the latter, pro-duce a SymBet and, accordingly, could beidentified as orthologs by default. However,inspection of the COGs clearly shows thatslr0018 is a fumarase whereas alr3724 is anaspartate ammonia-lyase. This seems to bea clear-cut case of pseudoorthology causedby lineage-specific, differential loss of par-alogs, with the ensuing functional differences(see below).

    The most obvious case of pseudoparalogyis the presence of numerous pairs of homol-ogous genes of ancestral and endosymbiotic(mitochondrial or, in plants, chloroplastic)origin in eukaryotes (34, 43, 56). These pseu-doparalogs are particularly abundant amongthe components of the translation machin-ery, such as ribosomal proteins or aminoacyl-tRNA synthetases. Many additional pseu-doparalogs appear to have emerged throughother routes of HGT. One of the most con-spicuous is the transfer of archaeal genesto bacteria, particularly hyperthermophiles.Figure 9 shows the phylogenetic tree for theperoxiredoxin AhpC (COG0450). This treeincludes two paralogous proteins from thehyperthermophilic bacterium Aquifex aeolicus,one of which clusters with archaeal and theother with bacterial homologs. The respec-tive genes are pseudoparalogs because theyapparently ended up in the A. aeolicus genomeas a result not of gene duplication at anystage of evolution but of horizontal trans-fer of one of the peroxiredoxin genes froman archaeal source. Conversely, the tree in-

    cludes three peroxiredoxins from the archaeaThermoplasma acidophilum and T. volcanium,at least two of which (the one nested withinthe archaeal subtree and the one with a clearbacterial affinity) are pseudoparalogs. No-tably, the only peroxiredoxin of another hy-perthermophilic bacterium, Thermotoga mar-itima, shows an archaeal affinity, suggestingthat in this lineage, the original bacterial genehad been lost, probably after the acquisitionof the archaeal version.

    Protein Domain Rearrangements,Gene Fusions/Fissions, andOrthology

    Orthologous protein in eukaryotes sometimesdiffer in their domain architectures. Thereseems to be a trend toward an increase in thecomplexity of domain architecture in paral-lel with the increase of organismal complex-ity, a phenomenon dubbed domain accretion(48, 49). Apparently, additional domains ac-quired by proteins from more complex or-ganisms provide additional interactions lead-ing, in particular, to increased complexityof signal transduction and various regulatoryprocesses. Differences in domain architec-tures can also be detected between orthologsfrom major prokaryotic taxa; one such case isillustrated in Figure 10a. The DnaG-like pri-mases of bacteria and archaea share a highlyconserved catalytic domain and appear to beorthologous, especially given that they arerepresented by a single protein in all bacte-rial and archaeal genomes (62). However, thebacterial and archaeal orthologs have differentaccessory domains, a Zn-finger and a distinctmodule of a helicase domain, respectively(Figure 10a), which may reflect substantialfunctional differences (see next section).

    As mentioned above, gene fusions and fis-sions, which are common in genome evolu-tion, affect the very notion of orthology: Inthis case, a single gene in some species cod-ing for a multidomain protein is orthologousto two or more distinct genes coding for therespective individual domains in another set

    www.annualreviews.org • Orthologs and Paralogs 327

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Figure 9Horizontal gene transfer leading to pseudoparalogy. The two pseudoparalogous peroxiredoxins fromAquifex aeolicus are shown in red, the three pseudoparalogs from the Thermoplasmas in blue, and theonly peroxiredoxin of Thermotoga maritima in purple. The genes are identified by full species names.The maximum likelihood, unrooted phylogenetic tree was constructed as previously described (70).

    328 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    Figure 10Rearrangements of gene structure and orthology. (a) Domain architectures of bacterial and archaealDnaG-like primases. (b) Independent fission of the DNA polymerase I gene in multiple bacteriallineages. (c) Fusion of the genes for bacterial glycyl-tRNA synthetase subunits.

    of species. Figure 10 (b and c) shows twocases of such relationships. In the example inFigure 10b, the bacteria A. aeolicus, A.pyrophilus, Desulfitobacterium hafniense, andRubrobacter xylanophilus encode the poly-merase and 5′–3′ exonuclease domains ofDNA polymerase I in two distinct genes, un-like all other bacteria in which these enzymesare domains of a single, multidomain protein.The bacteria in which the nuclease and poly-merase activities reside in different proteins

    belong to three distinct lineages, suggestingthree independent fissions of the polA gene.Whereas the genes for the nuclease and thepolymerase are adjacent in the two Aquifexspecies, in the other two bacteria they are not,implying genome arrangement subsequentto gene fission. By contrast, the example inFigure 10c shows the fusion of the genesfor the α and β subunits of glycyl-tRNAsynthetases in the parasitic bacteria ofthe Chlamydiae branch and the pathogenic

    www.annualreviews.org • Orthologs and Paralogs 329

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    actinobacterium Tropheryma whipplei. In thiscase, it appears likely that gene fusion oc-curred only once, with subsequent horizontaldissemination of the fused gene.

    FUNCTIONAL CORRELATES OFORTHOLOGY AND PARALOGY

    The validity of the conjecture on functionalequivalency of orthologs is crucial for reliableannotation of newly sequenced genomes and,more generally, for the progress of functionalgenomics. The huge majority of genes in thesequenced genomes will never be studied ex-perimentally, so for most genomes transfer offunctional information between orthologs isthe only means of detailed functional char-acterization. To what extent is such trans-fer legitimate? A rough estimate can be ob-tained by comparing the available functionalinformation for experimentally characterizedone-to-one orthologs from model organisms.Inspection of the 1330 COGs that containone-to-one orthologs from the well-studiedbacteria E. coli and B. subtilis failed to re-veal a single clear-cut case of different func-tions, although subtle differences, e.g., in en-zyme or transporter specificities are common(E.V.K., unpublished observations). Thus, ingeneral, the notion that one-to-one orthologsare functionally equivalent seems to hold well.

    However, at greater evolutionary dis-tances, particularly across the primary king-dom divides, there are prominent cases of ap-parent major differences in the functions oforthologs. Thus, the bacterial and archaealDnaG-like primases, which figure in the pre-vious section in connection with a differencein domain architecture, seem to function infundamentally different processes. BacterialDnaG is an essential component of the repli-cation machinery, namely the polymerase re-sponsible for the synthesis of RNA primersused to initiate replication (4). Although thefunction of the archaeal ortholog has not beenstudied in detail, there is no evidence of its in-volvement in replication; furthermore, it hasbeen shown to associate with the exosome, the

    RNA degradation complex, suggesting a rolein RNA processing (21). A converse situationseems to exist with the archaeo-eukaryotic-type primase which is an essential replicationcomponent in archaea and eukaryotes, but isinvolved in a distinct repair pathway in thosebacteria that have this gene (11). In this case,the bacterial versions actually are likely to bexenologs of the archaeal and eukaryotic ones(2), and they apparently went through a periodof rapid evolution associated with the func-tional change.

    Acceleration of evolution accompanying aradical functional switch seems to have beena major aspect of the emergence of the eu-karyotic cell. Thus, to the best of our under-standing, eukaryotic tubulins are co-orthologsof the prokaryotic protein FtsZ, which is thekey component of the prokaryotic cell divi-sion machinery mediating septum formation(1, 58). The well-characterized functions oftubulins are completely different: They arethe principal constituents of the eukaryoticmicrotubules, cytoskeletal structures that areabsent in prokaryotes [the recent discovery oftubulins in Prosthecobacteria (39) is remark-able but may be explained by HGT from eu-karyotes to a specific bacterial lineage]. Thedrastic change of function in eukaryotes ap-parently had been accompanied by a burst ofsequence evolution such that an unequivocaldemonstration of the homology of FtsZ andtubulin became possible only through com-parison of the respective protein structures(65). An analogous situation is seen with theother signature proteins of eukaryotes, actins,and ubiquitins, whose apparent prokaryoticorthologs have completely different functionsand dramatically differ in sequence (1, 92, 96).Although in each of these cases, the relation-ship between prokaryotic and eukaryotic pro-teins is not one-to-one orthology, with manyinparalogs present in eukaryotes, the func-tional differences among these paralogs areminor compared with the profound dividebetween prokaryotes and eukaryotes. Thegeneral message from this brief survey ofthe functional equivalency of orthologs and

    330 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    of the functional switches within orthologouslineages is clear: The fundamental functionsof orthologs do change but such changes arefar from being common, tend to be associatedwith major evolutionary transitions, and areaccompanied by a substantial acceleration ofevolution.

    The functional connotations of paralogyare distinct from and, in a sense, opposite tothose of orthology. Although in some cases,paralogs may retain the same, ancestral func-tion, being fixed due to the gene dosageeffects (amplification of rRNA genes is anobvious example), the general themes asso-ciated with paralogy are functional diversifi-cation and specialization. The subfunctional-ization mode of evolution of paralogs that hasbeen explored theoretically in great detail byLynch and coworkers (27, 60, 61) seems bestcompatible with the demonstration that selec-tive constraints affect paralogs even immedi-ately after duplication (45). Examples of sub-functionalization are plentiful. A classic oneis the distribution of the universal transcrip-tional function of the archaeal RNA poly-merase between the three RNA polymerasesof eukaryotes, with RNA polymerase I be-coming responsible for the transcription ofrRNA genes, RNA polymerase II transcrib-ing protein-coding genes, and RNA poly-merase III transcribing tRNA genes and thosefor other noncoding RNAs (101). The orig-inal version of Ohno’s neofunctionalizationmodel, whereby one of the emerging par-alogs initially evolves free of constraints (likea pseudogene) but then accidentally hits ona new function, might be unrealistic or, atleast, rare. More generally, however, evolu-tion of paralogs, particularly in the contextof lineage-specific expansions of paralogousfamilies, may involve both subfunctional-ization and neofunctionalization. Indeed, itseems inevitable that, among the enormousrepertoire of signal-transduction systems thatevolved via multiple duplications, such as pro-tein kinases, receptors, and ubiquitin systemscomponents (to mention just a few of the mostconspicuous cases), there should be specifici-

    ties that were not present in the ancestralgene. Very recently, He & Zhang exploredthe possibility of combination of subfunction-alization and neofunctionalization by exam-ining protein-protein interactions of paralo-gous gene products in yeast (36). Their resultssuggested a more complex subneofunctional-ization model under which the evolution ofparalogs starts with rapid subfunctionalizationbut subsequently often switches to the neo-functionalization mode.

    GENERAL DISCUSSION

    Orthology and Paralogy asEvolutionary Inferences andthe Homology Debates

    The preceding discussion aimed to show howthe notions of orthology and paralogy perme-ate modern genomics and provide the cruciallink between genomics and evolutionary biol-ogy. To a large extent, these concepts form thefoundation of evolutionary genomics and arealso of major importance for functional ge-nomics or, in more practical terms, for func-tional annotation of sequenced genomes. It isuseful, however, to explicitly define the episte-mological status of these concepts. Orthologyand paralogy as well as the generalized notionof homology imply specific statements on thecourse of evolution of the respective genes. Inother words, these statements are inferencesfrom one or another form of phylogeneticanalysis rather than observables. The prin-cipal observables in comparative genomicsare sequence similarity between genes andthe proteins they encode and, increasingly,structural similarity between proteins. Theseobservations are employed, directly or indi-rectly (e.g., through phylogenetic analysis),to infer orthology and paralogy (or generichomology). Failure to distinguish betweenobservables and inferences resulted in apersistent terminological morass. Stronghomology, percentage homology, and simi-lar oxymorons have survived in the biolog-ical literature for decades despite strongly

    www.annualreviews.org • Orthologs and Paralogs 331

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    worded refutations, and even guidelines regu-lating the usage of “homology” that have beenadopted by some journals (81, 97). Because ofthese abuses but, more importantly, becausebiologists often consider evolutionary infer-ences to be inherently unreliable, suggestionshave been made to dispense with the infer-ential terms (and, presumably, the underlyingconcepts) altogether. In a recent provocativearticle, Varshavsky proposed using the new,inference-free terms “sequelog” and “spalog”to designate, respectively, proteins that showsequence or structural similarity to each other(93). In an earlier eloquent comment, Petskoargued against using the terms ortholog andparalog on the simpler grounds that theseterms unnecessarily complicate the narrativein research articles without clarifying any-thing (80). The desire for simplicity and useof neutral terms is understandable and wouldhave been justified if orthology and paral-ogy (and all the derivatives thereof ) were justwords. However, as I argued here and else-where (46), this does not seem to be thecase. Instead, orthology and paralogy appearto be concepts that carry substantive mean-ing and, even apart from the (perhaps, debat-able to some) intrinsic interest of evolutionaryrelationships, have major functional conno-tations. Although the terms orthologs andparalogs may complicate the language of ge-nomics, in my opinion, the clarification theybring to our understanding of the evolu-tionary and functional relationships amonggenes and genomes by far outweighs anyinconvenience.

    Generalized Concepts of Orthologyand Paralogy

    Throughout most of this review and othertreatises, the concepts of orthology andparalogy are applied to genes as units (i.e.,whenever we speak of orthologs, we meanorthologous genes). However, I also dis-cussed complications to this approach stem-ming from gene fusion/fission and, less triv-ially, from lineage-specific changes in the

    domain architecture of orthologous proteinsoccurring during evolution. The latter situa-tion challenges the gene-centric definition oforthology inasmuch as certain parts of genesappear to be orthologous whereas others arenot. In principle, it seems possible to extendthe notion of orthology to individual domainsand, ultimately, to any stretch of nucleotidesequence down to a single base (97). The fun-damental definition always remains the same:Genomic elements in the compared speciesthat descend from the same ancestral ele-ment in the genome of their last commonancestor should be considered orthologous.From a maximalist standpoint, one could ar-gue that evolutionary relationships within aset of genomes should be considered resolvedonly after the status of each base pair in eachgenome (both in coding and noncoding re-gions) is established with respect to orthologyand paralogy. This could be an achievable goalfor closely related genomes (e.g., human andchimpanzee) but seems to be unrealistic fordistant species.

    On the opposite, genome-wide scale, thenotions of orthology and paralogy naturallyapply not only to genes but to strings ofgenes that retain the ancestral order (con-served synteny blocks). In relatively closely re-lated genomes (e.g., primates and rodents ordifferent enterobacterial species), a conservedsynteny block may include hundreds or eventhousands of genes, whereas in distantly re-lated genomes, there is very little conservationof gene order (7, 99). Thus, orthology andparalogy are manifest throughout all levels ofgenome comparison. Nevertheless, the gene-centric perspective adopted in the precedingsections appears to be most relevant for dis-secting the results of comparison of multiplegenomes separated by a wide range of evolu-tionary distances.

    A different aspect of generalization of theconcepts of orthology and paralogy pertainsto the complex structure of orthologous geneclusters caused by the spread of duplicationevents over the phylogenetic tree. A singleorthologous cluster defined at the deepest

    332 Koonin

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    branching point of a tree is often resolved intoseveral clusters within subtrees. The cases oftubulins and actins briefly discussed above ina different context clearly illustrate this point.All eukaryotic tubulins are co-orthologousto the single prokaryotic FtsZ proteins, justas actins are orthologous to the prokaryoticMreB. Within eukaryotes, however, severalorthologous sets can be readily identified foreach of these proteins.

    CONCLUSIONS

    Orthology and paralogy are not only keyterms but are an integral part of the concep-

    tual foundation of evolutionary and functionalgenomics. Only consistent usage of these andsome derivative definitions, such as in- andoutparalogs, provides for construction of a ro-bust evolutionary classification of genes andreliable functional annotation of newly se-quenced genomes. Further improvement ofclustering and phylogenetic methods for iden-tification of orthologs and paralogs is requiredfor the progress of genomics as the number ofsequenced genomes rapidly increases. Orthol-ogy and paralogy appear to be rich and flexibleconcepts that allow further development andare well suited to describe the complexity ofgenome evolution.

    SUMMARY POINTS

    1. Orthologs and paralogs are two types of homologous genes that evolved, respectively,by vertical descent from a single ancestral gene and by duplication.

    2. Distinguishing between orthologs and paralogs is crucial for successful functionalannotation of genomes and for reconstruction of genome evolution.

    3. A finer classification of orthologs and paralogs has been developed to reflect theinterplay between duplication and speciation events, and effects of gene loss andhorizontal gene transfer on the observed homologous relationship.

    4. Methods for identification of sets of orthologous and paralogous genes involve phy-logenetic analysis and various procedures for sequence similarity–based clustering.

    5. Analysis of clusters of orthologous and paralogous genes is instrumental in genomeannotation and in delineation of trends in genome evolution.

    6. Rearrangements of gene structure confound orthologous and paralogous relation-ships.

    7. The gene-centered concepts of orthology and paralogy can be generalized down-ward, to the level of strings of nucleotides and even single base pairs, and upward, tomultigene arrays.

    ACKNOWLEDGMENTS

    I thank Yuri Wolf, Marina Omelchenko, and Kira Makarova for invaluable help with dataanalysis; Yuri Wolf for critical reading of the manuscript; and Walter Fitch, Roy Jensen, PavelPevzner, Erik Sonnhammer, Alexander Varshavsky, and Emile Zuckerkandl for instructivediscussions. Due to space constraints, it was impossible to cite all relevant publications in thisreview; my sincere apologies and appreciation to all colleagues whose important work is notcited.

    www.annualreviews.org • Orthologs and Paralogs 333

    Ann

    u. R

    ev. G

    enet

    . 200

    5.39

    :309

    -338

    . Dow

    nloa

    ded

    from

    arj

    ourn

    als.

    annu

    alre

    view

    s.or

    gby

    Sta

    nfor

    d U

    nive

    rsity

    Rob

    ert C

    row

    n L

    aw L

    ib. o

    n 05

    /01/

    07. F

    or p

    erso

    nal u

    se o

    nly.

  • ANRV260-GE39-15 ARI 12 October 2005 11:22

    LITERATURE CITED

    1. Amos LA, van den Ent F, Lowe J. 2004. Structural/functional homology between thebacterial and eukaryotic cytoskeletons. Curr. Opin. Cell Biol. 16:24–31

    2. Aravind L, Koonin EV. 2001. Prokaryotic homologs of the eukaryotic DNA-end-bindingprotein Ku, novel domains in the Ku protein and prediction of a prokaryotic double-strandbreak repair system. Genome Res. 11:1365–74

    3. Arvestad L, Berglund AC, Lagergren J, Sennblad B. 2003. Bayesian gene/species treereconciliation and orthology analysis using MCMC. Bioinformatics 19(Suppl.)1:i7–15

    4. Benkovic SJ, Valentine AM, Salinas F