viewing the microbial world through the lens of the gene ontology

3
Special issue: Gene Ontology for microbiologists Viewing the microbial world through the lens of the Gene Ontology Brett M. Tyler Virginia Bioinformatics Institute, Virginia Polytechnic and State University, Blacksburg, VA 24061, USA The microbial world is incredibly diverse, spanning many kingdoms of life and an extraordinary range of abiotic and biotic environments. The Gene Ontology (GO) provides standardized terms that are well suited to identifying diverse genes that perform similar roles in microbes, even when little or no primary sequence similarity is present, and for linking reductionist and systems-level experimen- tal data. The use of GO terms to describe microbial func- tions began with the inclusion of Saccharomyces cerevisiae in the original GO Consortium, and recently, more than 800 terms have been added by the Plant-Associated Microbe Gene Ontology (PAMGO) Consortium. In this special issue, nine articles review a wide variety of dimen- sions of the microbial world that have been described using GO terms. The microbial world is incredibly diverse, spanning many kingdoms of life and an extraordinary range of abiotic and biotic environments. Yet many of the chal- lenges that microbes face are similar: finding adequate sources of energy, carbon, nitrogen and other key nutri- ents; avoiding or tolerating desiccation; tolerating extremes of heat, pH, salinity or radiation exposure; and in the case of microbes that associate with larger host organisms, avoiding, neutralizing or tolerating host defense responses. With the rapid increase in microbial genomic sequencing [1,2], it has become possible in prin- ciple to examine the genomes of diverse microorganisms in search of genes responsible for common functions. In many cases, the genetic solutions to these challenges are ances- tral and have been inherited by all organisms; for example DNA repair [3]. In other cases, extensive lateral gene transfer has resulted in one genetic solution being shared among diverse taxa; for example, genes for nitrogen fix- ation and antibiotic resistance [4]. Lateral gene transfer even seems to occur across the prokaryoteeukaryote divide [5]. In these cases, identifying genes responsible for common functions is greatly facilitated by the DNA sequence similarities that reflect the common origin of the genes involved. However, there are also numerous instances in which diverse microbes have evolved new functions independently, through neofunctionalization of existing genes or copies of them [6,7] and/or by reprogram- ming the regulatory circuits that control their existing genes [8]. One of the principal examples of this is the evolution of microbehost associations, which has occurred independently at least once (and probably many times) in many kingdoms of life, including archaea (e.g. symbionts of rumen ciliates [9]), eubacteria (numerous examples), animals (e.g. nematodes), fungi (numerous examples), mycetozoa (e.g. Acanthameba), green plants (e.g. Helicos- poridium pathogens of insects and algal symbionts of lichens), Stramenopiles (e.g. oomycete pathogens and di- atom symbionts of sponges), Alveolata (e.g. Plasmodium and dinoflagellate symbionts of corals), Euglenozoa (e.g. trypanosomes), Diplomonads (e.g. Giardia) and Parabasa- lids (e.g. Trichomonas). In these cases, which represent convergent evolution, identifying genes that account for similarities of function is much more difficult because entirely different sets of genes might have been co-opted by the organisms. The alternative to using DNA sequence similarity to identify genes with common functions is to draw upon experimental data that have been used to describe the function of individual genes. Thus, in principle, one could search the peer-reviewed literature for common descrip- tions of gene functions that arise from experiments. How- ever, a major challenge of this approach is the variation in the use of natural language to describe gene functions. For example, ‘attachment’, ‘adhesion’ and ‘pre-penetration activity’ all describe similar events. A further challenge is that different communities often construct different understandings of related concepts and/or might use iden- tical words to describe completely different processes. For example, for researchers who study obligate pathogens such as viruses or rust fungi that are entirely dependent on their hosts for proliferation, pathogenicity is considered fundamental to every function of the organism. However, for researchers who study opportunistic pathogens, patho- genicity is considered a side issue, relevant to perhaps a few genes and a few specific environments. The GO provides a solution to this challenge and has been rapidly growing in its scope and use since its inception in 1998 [10,11]. The GO provides a standardized set of terms for describing the roles of gene products. The GO consists of three ontologies: molecular function, biological process and cellular component. The terms in these ontol- ogies can be used to describe the biochemical activities of a gene product, the higher level processes to which it con- tributes and its cellular locations, respectively. The GO is not a simple hierarchical structure because a lower-level (i.e. more detailed) term can have more than one higher- level (less detailed) term as immediate parents. For example, ‘interaction with host’ (GO:0051701) has two direct parent terms, ‘symbiosis, encompassing mutualism Editorial Corresponding author: Tyler, B.M. ([email protected]). 259

Upload: brett-m-tyler

Post on 29-Oct-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Viewing the microbial world through the lens of the Gene Ontology

Special issue: Gene Ontology for microbiologists

Viewing the microbial world throughthe lens of the Gene OntologyBrett M. Tyler

Virginia Bioinformatics Institute, Virginia Polytechnic and State University, Blacksburg, VA 24061, USA

Editorial

The microbial world is incredibly diverse, spanning manykingdoms of life and an extraordinary range of abiotic andbiotic environments. The Gene Ontology (GO) providesstandardized terms that are well suited to identifyingdiverse genes that perform similar roles in microbes, evenwhen little or no primary sequence similarity is present,and for linking reductionist and systems-level experimen-tal data. The use of GO terms to describe microbial func-tions began with the inclusion of Saccharomyces cerevisiaein the original GO Consortium, and recently, more than800 terms have been added by the Plant-AssociatedMicrobe Gene Ontology (PAMGO) Consortium. In thisspecial issue, nine articles review a wide variety of dimen-sions of themicrobial world that have been described usingGO terms.

The microbial world is incredibly diverse, spanningmany kingdoms of life and an extraordinary range ofabiotic and biotic environments. Yet many of the chal-lenges that microbes face are similar: finding adequatesources of energy, carbon, nitrogen and other key nutri-ents; avoiding or tolerating desiccation; toleratingextremes of heat, pH, salinity or radiation exposure; andin the case of microbes that associate with larger hostorganisms, avoiding, neutralizing or tolerating hostdefense responses. With the rapid increase in microbialgenomic sequencing [1,2], it has become possible in prin-ciple to examine the genomes of diverse microorganisms insearch of genes responsible for common functions. In manycases, the genetic solutions to these challenges are ances-tral and have been inherited by all organisms; for exampleDNA repair [3]. In other cases, extensive lateral genetransfer has resulted in one genetic solution being sharedamong diverse taxa; for example, genes for nitrogen fix-ation and antibiotic resistance [4]. Lateral gene transfereven seems to occur across the prokaryote–eukaryotedivide [5]. In these cases, identifying genes responsiblefor common functions is greatly facilitated by the DNAsequence similarities that reflect the common origin of thegenes involved. However, there are also numerousinstances in which diverse microbes have evolved newfunctions independently, through neofunctionalization ofexisting genes or copies of them [6,7] and/or by reprogram-ming the regulatory circuits that control their existinggenes [8]. One of the principal examples of this is theevolution of microbe–host associations, which has occurredindependently at least once (and probably many times) in

Corresponding author: Tyler, B.M. ([email protected]).

many kingdoms of life, including archaea (e.g. symbionts ofrumen ciliates [9]), eubacteria (numerous examples),animals (e.g. nematodes), fungi (numerous examples),mycetozoa (e.g. Acanthameba), green plants (e.g. Helicos-poridium pathogens of insects and algal symbionts oflichens), Stramenopiles (e.g. oomycete pathogens and di-atom symbionts of sponges), Alveolata (e.g. Plasmodiumand dinoflagellate symbionts of corals), Euglenozoa (e.g.trypanosomes), Diplomonads (e.g. Giardia) and Parabasa-lids (e.g. Trichomonas). In these cases, which representconvergent evolution, identifying genes that account forsimilarities of function is much more difficult becauseentirely different sets of genes might have been co-optedby the organisms.

The alternative to using DNA sequence similarity toidentify genes with common functions is to draw uponexperimental data that have been used to describe thefunction of individual genes. Thus, in principle, one couldsearch the peer-reviewed literature for common descrip-tions of gene functions that arise from experiments. How-ever, a major challenge of this approach is the variation inthe use of natural language to describe gene functions. Forexample, ‘attachment’, ‘adhesion’ and ‘pre-penetrationactivity’ all describe similar events. A further challengeis that different communities often construct differentunderstandings of related concepts and/or might use iden-tical words to describe completely different processes. Forexample, for researchers who study obligate pathogenssuch as viruses or rust fungi that are entirely dependenton their hosts for proliferation, pathogenicity is consideredfundamental to every function of the organism. However,for researchers who study opportunistic pathogens, patho-genicity is considered a side issue, relevant to perhaps afew genes and a few specific environments.

The GO provides a solution to this challenge and hasbeen rapidly growing in its scope and use since its inceptionin 1998 [10,11]. The GO provides a standardized set ofterms for describing the roles of gene products. The GOconsists of three ontologies: molecular function, biologicalprocess and cellular component. The terms in these ontol-ogies can be used to describe the biochemical activities of agene product, the higher level processes to which it con-tributes and its cellular locations, respectively. The GO isnot a simple hierarchical structure because a lower-level(i.e. more detailed) term can have more than one higher-level (less detailed) term as immediate parents. Forexample, ‘interaction with host’ (GO:0051701) has twodirect parent terms, ‘symbiosis, encompassing mutualism

259

Page 2: Viewing the microbial world through the lens of the Gene Ontology

Editorial Trends in Microbiology Vol.17 No.7

through parasitism’ (GO:0044403) and ‘interspecies inter-action between organisms’ (GO:0044419). These termshave precise definitions that have been agreed upon bythe community of users of GO terms. In general, GO terms,especially the high-level ones, are intended to capture thebroadest similarity of function possible across diversespecies. The GO and the terms within it are also constantlyupdated as additional information is obtained, new con-cepts are introduced from previously or newly participat-ing communities and existing concepts are modified ordiscarded. This dynamism ensures that the GO remainscurrent, although it might sometimes create challenges forbioinformaticists who must update prior annotations.

The GOwas established in 1998 by a consortium of threedatabases, the Saccharomyces Genome Database, Flybase(for Drosophila) and the Mouse Genome Database [11,12].Thus, eukaryotic microbes were represented from the startby the yeastS. cerevisiae. Since then, as summarized in thisissue [13], yeast has often served as a model system for thedevelopment of computational methods that leverage GOannotations, owing to the very large amounts of data avail-able from both high-throughput, systems biology exper-iments and gene-by-gene reductionist experiments. Thefirst prokaryotic genome, that ofVibrio cholerae, was addedto the GO annotation database in 2000 by The Institute forGenome Research, or TIGR (now the J. Craig Venter Insti-tute); this contribution included the first major effort tomodify the GO to address prokaryotes. As summarized inthis issue [14], presently there are 23 complete bacterialgenomes represented in theGOannotation repository.Geneannotation in Escherichia.coli has traditionally employedtheMultifun scheme, onwhich theGOwas – in part – based,andmostE. coli annotations still utilizeMultifun, aswell astools contained in the EcoCyc database. Recently, however,as described in this issue [15], a concerted effort has beenmade to apply GO annotations to E. coli gene products sothat the vast amount of functional information derived fromgenetic and biochemical experiments inE. coli can be lever-agedvia theGO.Asa result of this effort,more than1400GOannotations of E. coli genes have been added that are notbased simply on automated computational methods.

Another recent major effort directed at the GO annota-tion of microbes has been led by the PAMGO Consortium.This consortium was formed in 2004 by ten researchersaffiliated with the Virginia Bioinformatics Institute atVirginia Tech, Cornell University, Wells College, the Uni-versity of Wisconsin at Madison, North Carolina StateUniversity and (at the time) TIGR. The organisms studiedby consortium members span bacteria, fungi, oomycetesand nematodes. As summarized in this issue [14], althoughthe focus of the PAMGO Consortium has been on microbe–

plant associations, by design many of the terms developedare equally applicable to microbe–animal interactions andhave enabled unifying concepts to be identified amongplant-associated and animal-associatedmicrobes. Further-more, a major outcome of the PAMGO Consortium’s workhas been a restructuring of the GO to accommodate termsto describe processes involved in interactions amongorganisms at all levels. The PAMGO Consortium has con-tributedmore than 800 new terms, covering awide range offunctions, to the GO [14].

260

A central concept behind the terms developed by thePAMGO Consortium is that microbe–host interactions liealong a continuum frommutualism to parasitism, in whichthe relative benefits or costs to each organism can be fluiddepending on such factors as the genotype of each partnerand the biotic and abiotic environment of the interaction.Genes that contribute to parasitism in one microbialspecies might contribute to mutualism in a closely relatedsister species. PAMGO uses the original meaning of theterm ‘symbiosis’ as ‘any intimate interaction between twoorganisms’ to describe the continuum of relationships frommutualism to parasitism and strongly discourages the useof ‘symbiosis’ as a synonym for ‘mutualism’ [14]. The termsdeveloped by PAMGO are guided by the principle thatalmost every function, process or structure that contrib-utes to or occurs during a microbe–host interaction mightdo so irrespective of the eventual cost–benefit balancebetween microbe and host. Four of the articles in this issuesummarize different facets of microbe–host interactionsthat have been addressed using PAMGO terms, namelyfunctions of bacterial effector proteins [16] and virulencefactors [17], effector protein delivery [18], and infection byfilamentous pathogens [19]. In addition, Arnaud et al. [8]describe how the terms created by the PAMGOConsortiumhave been employed in annotating genes from the humanpathogen Candida albicans, nicely illustrating how GOterms can be used to identify unifying concepts in plant andanimal pathogenesis. Annotation of C. albicans genes hasbenefited from the transfer of GO terms from the closelyrelated yeast S. cerevisiae, yielding the unexpected obser-vation that many regulatory pathways in C. albicans havebeen rewired relative to S. cerevisiae [8]. The PAMGOConsortium did not address viral infection because itdiffers fundamentally from infection by cellular symbionts.However, in this issue, McCarthy et al. [20] show how GOterms and the data repository AgBase have aided in theunderstanding of animal viruses. Terms relevant to certainkinds of bacterial virulence factors, such as toxins andantibiotic resistance genes, are absent from the GObecause the actions of these virulence factors are contextdependent; whether something is a toxin, for example,depends on the host. Colosimo et al. [17] describeadditional controlled vocabularies and databases thatare specialized for virulence factors that complement whatthe GO is designed to encompass.

One of the most important functions that the GO canprovide is a strong connection between reductionist exper-imental biology and high-throughput systems biology.These two approaches are complementary and, in the idealcase, highly synergistic. Reductionist experiments aim toproduce highly detailed and reliable functional infor-mation about a limited number of genes by using a widediversity of experimental technologies, some of whichmight be very labor-intensive. Systems biology aims tocharacterize entire sets of cellular components by usinga limited set of very rapid but necessarily error-pronetechnologies, and then draw inferences about biologicalmechanisms based on the global properties of the data.Systems biology can identify emergent properties of bio-logical systems that are not apparent from consideration ofsmall sets of genes, no matter how well characterized.

Page 3: Viewing the microbial world through the lens of the Gene Ontology

Editorial Trends in Microbiology Vol.17 No.7

Simultaneously, these inferences are wholly dependent ondetailed experimental information available about thefunctions of key genes in the dataset. The essential con-tribution of the GO is to translate reductionist informationinto a machine-readable format that is available for thecomputational algorithms used to interpret systems data.GO annotations are always associated with evidence codessuch as IDA (inferred from direct assay), IGI (inferred fromgenetic interaction) or IEP (inferred from expression pat-tern) so that the provenance of the data can be incorporatedinto the computational analysis [10,14].

A key bottleneck exists in the process of creating theconnection that the GO provides between reductionist andsystems biology data, however. Currently, GO terms andevidence codes are assigned to gene products manually bycurators who read and synthesize sets of peer-reviewedpapers describing the functional characterization of thosegenes, sometimes creating new GO terms where needed.This is necessarily a slow, labor-intensive process, and onewhich granting agencies are still slow to fund. I, therefore,conclude this editorial by calling for authors to include theirownGOtermassignmentswhenpublishingnewexperimen-tal data about genes, so their findings are automaticallymachine readable. Journals should actively encourage oreven mandate this practice, just as GenBank submissionsaremandated forDNAsequencedata. The quality of theGOterm assignments could readily be assessed as part of thepeer review process. I also call on granting agencies andtheir reviewer communities to actively support efforts tocurate the vast tracts of experimental data that have beenproduced in the past and continue to be created. GOannota-tion also offers exciting and stimulating educational oppor-tunities; if everyundergraduate- and graduate-level journalclub, term paper or research internship ended with thesubmission of peer-reviewable GO term assignments, thebody of GO annotations would rapidly advance. In thiscontext, social networking technologies and tools such aswikis offer exciting opportunities for community-level syn-thesis and quality control.

AcknowledgementsI thank the members of the PAMGO Consortium and the Gene OntologyConsortium for their collaboration in developing many GO terms and formuch valuable discussion. I also thank Trudy Torto-Alalibo, MichelleGwinn-Giglio and Candace Collmer for their critical reading of themanuscript and Emily Alberts for manuscript assistance. This work was

supported by the National Research Initiative of the USDA CooperativeState Research, Education and Extension Service, grant number 2005–

35600–16370, and by the US National Science Foundation, grant numberEF-0523736.

References1 Binnewies, T.T. et al. (2006) Ten years of bacterial genome sequencing:

comparative-genomics-based discoveries. Funct. Integr. Genomics 6,165–185

2 Guzman, E. et al. (2008) Completely sequenced genomes of pathogenicbacteria: a review. Enferm. Infecc. Microbiol. Clin. 26, 88–98

3 Aravind, L. et al. (1999) Conserved domains inDNA repair proteins andevolution of repair systems. Nucleic Acids Res. 27, 1223–1242

4 Boucher, Y. et al. (2003) Lateral gene transfer and the origins ofprokaryotic groups. Annu. Rev. Genet. 37, 283–328

5 Keeling, P.J. and Palmer, J.D. (2008) Horizontal gene transfer ineukaryotic evolution. Nat. Rev. Genet. 9, 605–618

6 He, X. and Zhang, J. (2005) Rapid subfunctionalization accompanied byprolonged and substantial neofunctionalization in duplicate geneevolution. Genetics 169, 1157–1164

7 Rodriguez-Trelles, F. et al. (2003) Convergent neofunctionalization bypositive Darwinian selection after ancient recurrent duplications of thexanthine dehydrogenase gene. Proc. Natl. Acad. Sci. U. S. A. 100,13413–13417

8 Arnaud, M.B. et al. Gene Ontology and the annotation of pathogengenomes: the case of Candida albicans. Trends Microbiol

9 Irbis, C. and Ushida, K. (2004) Detection of methanogens andproteobacteria from a single cell of rumen ciliate protozoa. J. Gen.Appl. Microbiol. 50, 203–212

10 Harris, M.A. et al. (2004) The Gene Ontology (GO) database andinformatics resource. Nucleic Acids Res. 32, D258–D261

11 The Gene Ontology Consortium (2000) Gene ontology: tool for theunification of biology. Nat. Genet. 25, 25–29

12 Lewis, S.E. (2005) Gene Ontology: looking backwards and forwards.Genome Biol. 6, 103

13 Christie, K.R. et al. Functional annotations for the Saccharomycescerevisiae genome: the knowns and the known unknowns. TrendsMicrobiol

14 Gwinn-Giglio, M. et al. Applying the Gene Ontology in microbialannotation. Trends Microbiol

15 Hu, J.C. et al. What we can learn about Escherichia coli throughapplication of Gene Ontology. Trends Microbiol

16 Lindeberg, M. and Collmer, A. Gene Ontology for type III effectors:capturing processes at the host–pathogen interface. Trends Microbiol

17 Colosimo, M.E. and Korves, T. Controlled vocabularies for microbialvirulence factors. Trends Microbiol

18 Chibucos, M.C. et al. Describing commonalities in microbial effectordelivery using the Gene Ontology. Trends Microbiol

19 Torto-Alalibo, T. et al. Infection strategies of filamentous microbesdescribed with the Gene Ontology. Trends Microbiol

20 McCarthy, F.M. et al., Understanding animal viruses using the GeneOntology. Trends Microbiol

0966-842X/$ – see front matter � 2009 Elsevier Ltd. All rights reserved.

doi:10.1016/j.tim.2009.05.002 Available online 3 July 2009

261