how to get the most from fission yeast genome data: a report from the 2006 european fission yeast...
TRANSCRIPT
YeastYeast 2006; 23: 905–912.Published online in Wiley InterScience(www.interscience.wiley.com) DOI: 10.1002/yea.1419
Report
How to get the most from fission yeast genome data:a report from the 2006 European Fission YeastMeeting Computing WorkshopValerie Wood*Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1HH, UK
*Correspondence to:Valerie Wood, Wellcome TrustSanger Institute, Hinxton,Cambridge CB10 1HH, UK.E-mail: [email protected]
Received: 24 August 2006Accepted: 29 August 2006
AbstractA fission yeast computing workshop ‘How to get the most from the fission yeastgenome data’ was run as a satellite to the European Fission Yeast Meeting. Thebroad aims of the workshop were to provide fission yeast bench biologists with a setof tools and protocols to query the fission yeast genome data in specific ways, in orderto extract biologically meaningful information of interest, which can be tailored tothe needs of individual research projects. A description of the workshop content isprovided and a selection of the tools presented are reviewed. Copyright 2006 JohnWiley & Sons, Ltd.
Keywords: genome analysis; Gene Ontology; Pfam; Sz. pombe; annotation; cura-tion; database; fission yeast, bioinformatics
Introduction
At present, the primary publicly available curatedrepositories relevant to fission yeast data arethe Model Organism Database (MOD) Sz. pombeGeneDB (Hertz-Fowler et al., 2004) and the asso-ciated Gene Ontology (GO) data (The Gene Ontol-ogy Consortium, 2004); the Uniprot Knowledge-base and ancillary databases (i.e. Interpro, Intact)(Wu et al., 2006) and the Pfam protein familydatabase (Finn et al., 2006). The fission yeast com-puting workshop was designed to demonstrate howthese data resources are integrated via the MODand to encourage their use by bench biologists,not only as static data repositories for simple textstring or Accession No. retrieval-based queries,but also as data-mining tools for the identifica-tion of potentially interesting gene sets based onshared features and concepts. Particular emphasiswas placed on:
1. Using GO for the retrieval of gene sets withshared known or inferred molecular function,
biological process or cellular component (loca-tion or complex) at different levels of granular-ity of knowledge (depth of annotation).
2. Identifying statistically over-represented GOterms for gene sets of interest.
3. Identifying protein families, protein domainsand potential orthologues, and their associ-ated functional annotations and species distri-butions.
4. Performing combined queries based on theseand other shared features and properties (proteinsize, chromosome, intron number, etc.).
5. Identifying functional clues for unstudied genesbased on sequence similarity and other contex-tual information.
6. Downloading query results in different formats.This feature provides a snapshot of a selec-tion of the tools and exercises which weredescribed in detail in the workshop. The com-plete workshop manual can be downloaded fromhttp://www.sanger.ac.uk/Projects/S pombe/documentation.shtml
Copyright 2006 John Wiley & Sons, Ltd.
906 V. Wood
GeneDB
GeneDB (http://www.genedb.org/genedb/pombe/) is a multi-species database which hostssequence, annotation and curation for the organ-isms sequenced by the Wellcome Trust SangerInstitute Pathogen Sequencing Unit (PSU). TheGeneDB module presented by Martin Aslett(Pathogen Sequencing Unit, Sanger Institute, UK)and Valerie Wood (Fission Yeast Functional Geno-mics Group, Sanger Institute, UK) provided a gen-eral overview of the annotation and features anddemonstrated the analysis tools available to search,browse and download fission yeast genome data.
The tools presented included the ‘Boolean queryinterface’, which allows users to build comprehen-sive combined queries based on a number of anno-tated features and properties (http://www.genedb.org/gusapp/servlet?page=boolq&organism=pombe). The ‘Boolean query interface’ providesa powerful tool for the identification of candidategene sets. Query sets can be selected and combinedusing ‘AND’ or ‘OR’ operators through a simpleweb interface. The illustrated example in Figure 1shows the selection of all genes with introns com-bined with all genes with the GO component termnucleus and the Pfam protein family WD repeat.To perform this query (or a similar query), firstselect the operator ‘AND’ twice, then select thethree queries, ‘Proteins containing a specific Pfamdomain’, ‘Genes with a specific GO component’and ‘Predicted genes with a range of exon number’.Select ‘Proceed to next step’; this will provide therange of options visible in Figure 1, which shouldbe selected from the pull-down menus, or typed inthe text box provided, before submitting the finalquery.
The results from Boolean query searches can bedownloaded individually or combined (added, sub-tracted or intersected) with other queries via the‘Query history’ feature, which is accessible fromthe bottom of the results page (http://www.genedb.org/gusapp/servlet?page=history). Using the‘Download’ feature from the ‘Query History’, theresults of queries can be downloaded as gene lists,spreadsheets containing user-defined fields (genenames, products, coordinates) or as Fasta formatsequence of the CDS, upstream or downstreamregions.
The Gene Ontology
The Gene Ontology (GO) consortium is a collab-orative open source project to develop controlledvocabularies to provide consistent descriptionsof gene products (The Gene Ontology Consor-tium, 2004; http://www.geneontology.org/). GOprovides ontologies for three biological domains,molecular function, biological process and cellu-lar component. GO terms are used to annotategene products consistently both within and betweenorganisms, to allow the manual and automatedretrieval of groups of similarly annotated genes.
The GO module presented by Midori Harris (GOEditorial Office, European Bioinformatics Institute,UK) provided an overview of the GO project. Thisincluded how both the ontology and the annotationsare implemented, structured and updated. The prin-ciples of GO and its application to fission yeast aredescribed in more detail in Aslett and Wood (thisissue). An annotation status update presented byValerie Wood provided the current GO annotationcoverage for fission yeast and showed how geneassociations to GO terms are derived from the lit-erature, sequence similarity searches and electronicmappings. The current status of the GO annotationfor fission yeast is also provided in Aslett and Wood(this issue).
A description and demonstration of the AmiGOGO browser was provided by Jane Lomax (GOEditorial Office, European Bioinformatics Institute,UK), who described the basic principles for retriev-ing gene sets, based on shared annotation from bothfission yeast and other organisms. The fission yeastGO data is available in AmiGO via the GO con-sortium website (http://www.godatabase.org/cgi-bin/amigo/go.cgi?) and the GeneDB implementa-tion of AmiGO (http://www.genedb.org/amigo/perl/go.cgi?species db=GeneDB Spombe). Theuse of AmiGO for browsing, searching and retriev-ing GO terms was demonstrated.
The AmiGO screenshot in Figure 2a shows theresults of a search using the ‘Terms’ option fromthe Amigo front page. This search identifies allGO terms containing a specific sub-string, in thisexample a search for ‘glucan synthase’. This pro-vides a results list which allows the selection ofa specific term ‘alpha-1,3-glucan synthase activity(GO:0 047 657)’. Clicking on the ‘tree icon’ at theleft of the term in the results provides the ‘tree’view of the GO term shown in Figure 2b. This view
Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea
Fission yeast computing workshop 907
Figure 1. This screen shot shows the selection of query options in the GeneDB ‘Boolean Query Interface’(http://www.genedb.org/gusapp/servlet?page=boolq&organism=pombe) to retrieve all gene products with aPfam WD-repeat domain and an annotation to ‘nucleus’ which contain at least one intron. To repeat this query (or performa similar query), first select the operator ‘AND’ twice, then select the three queries ‘Proteins containing a specific Pfamdomain’, ‘Genes with a specific GO component’ and ‘Predicted genes with a range of exon number’. Select ‘Proceed tonext step’; this will provide the range of options visible in Figure 1, which should be selected from the pull-down menus,or typed in the text box provided, before submitting the final query
shows a browsable ‘GO tree’ of the term alpha-1,3-glucan synthase activity and all of its parents. Thenumber of gene products annotated to a term (orany of its child terms) in the organism or databasebeing searched are shown in parentheses next tothe term name.
Another common application for genome-wideGO annotation is to identify statistically over-represented GO terms among groups of genes.The guided exercises included the use of the
Onto-Express software to identify over-representedGO terms in user-defined gene lists (e.g. geneupregulated in a microarray experiment; Khatriet al., 2002; http://vortex.cs.wayne.edu/projects.htm).
The Pfam protein family database
The Pfam protein family database provides acomprehensive resource for the identification of
Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea
908 V. Wood
(a)
(b)
Figure 2. A ‘term’ search in the AmiGO GO browser. 1, The AmiGO screenshot in Figure 2a shows the results of a searchusing the ‘Terms’ option from the Amigo front page. 2, This search identifies all GO terms containing a specific sub-string,in this example a search for ‘glucan synthase’. 3, This allows the selection of the more granular term ‘alpha-1,3-glucansynthase’ from the results list. Clicking on the ‘tree icon’ at the left of the term in the results provides the tree view of theGO term shown in Figure 2b. 4, This view shows a browsable GO tree of the term and all of its parents. 5, The numberof gene products annotated to a term (or any of its child terms) in the organism or database being searched are shown inparentheses. 6, Terms can be expanded to show additional child terms by clicking on the ‘+’ icon
protein families and domains (Finn et al., 2006;http://www.sanger.ac.uk/Software/Pfam/). ThePfam database is made of Pfam-A, high-qualitycurated alignments and automatically generatedPfam-B alignments. The Pfam module was pre-sented by Rob Finn (Pfam, Sanger Institute, UK)and included a description of the Pfam proteinfamily pages, the current Pfam coverage and anintroduction to a newer Pfam resource, Pfam Clans.
Individual Pfam protein family entries provideaccess to the protein alignment for the family; agraphical view of all proteins with a given domainor combination of domains (domain architecture); aview of the species distribution for a given domainor family and the phylogenetic tree. These entrypoints are illustrated in Figure 3, which shows thePfam graphical view of the Lid2 protein and thePfam entry for ‘JmjC’ (Accession No. PF02373),
Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea
Fission yeast computing workshop 909
(a)
(b)
Figure 3. Pfam domain organization for Lid2 and the JmjC Pfam entry. Links to Pfam are provided from the ‘DomainInformation’ section of every GeneDB gene page. Figure 3a shows the graphical view of the Pfam domain organization forthe protein Lid2. 1. The meaning of the different icons is described in the key, 2. Clicking on the ‘domain icon’ for ‘JmjC’will take you to the full entry for this family (PF02373) shown in Figure 3b. Individual Pfam protein family entries provideaccess to: 3. the protein alignment for the family; 4. a graphical view of all proteins with a given domain or combination ofdomains (domain architecture); 5. a view of the species distribution for a given domain or family; 6. the phylogenetic tree.These entry points are illustrated in Figure 3b
Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea
910 V. Wood
which is one of the domains present in Lid2. Pfamviews of individual proteins can be accessed fromthe ‘view Pfam domain structure for this geneproduct’ link in the ‘Domain Information’ sectionof every GeneDB gene page. For Lid2, this linkprovides the graphical view of the Pfam domainorganization for the protein (Figure 3a). Clickingon the ‘domain icon’ for ‘JmjC’ will take you tothe full Pfam entry for this family (Figure 3b).
Fission yeast has a Pfam-A protein family cov-erage of 77% compared to 72% for budding yeast.This coverage is higher than for any other fullyrepresented eukaryotic proteome and is partly dueto the greater conservation with higher eukary-otes and the apparent lower number of speciesspecific proteins. This extensive coverage makesthe Pfam-A features illustrated here available forthe vast majority (3838) of fission yeast proteins.Even in the absence of a Pfam-A domain, usefulinformation including the presence of conservedregions (identified by the Pfam-B uncurated align-ments), low-complexity regions, coiled-coil regionsand signal peptides can be accessed via Pfam imagemaps.
The on-line tutorial provided by Rob Finnincludes exercises to query the proteome basedon domain combinations and to compare pro-teomes of different species to identify commonand unique domains. A simple XML interface toproduce publication quality Pfam graphics is alsodemonstrated. The Pfam tutorial can be accessed athttp://www.sanger.ac.uk/Users/rdf/PombeTuto-rial/
YOGY
It is common practice to obtain functional cluesfor proteins of interest by detecting homologousproteins in other organisms and identifying theirroles. However, the accurate identification of ortho-logues is not trivial, especially when levels ofsequence similarity are low. Jurg Bahler (FissionYeast Functional Genomics Group, Sanger Insti-tute, UK) presented a module which included adescription of YOGY, a web-based resource forretrieving orthologues from fission yeast and otherfully sequenced genomes (Penkett et al., 2006;http://www.sanger.ac.uk/PostGenomics/S pombe/YOGY/). Using a gene or protein iden-tifier as a query, this resource can be used
to retrieve orthologues, using the major ortho-logue predictors (KOGS, Inparanoid, Homologeneand OrthoMcl) and a manually curated inven-tory of known or predicted orthologues betweenSz. pombe and S. cerevisiae to provide a compre-hensive list of orthologue candidates for furtherinspection (Tatusov et al., 2006; O’Brien et al.,2005; Wheeler et al., 2005; Li et al., 2003; Wood,2006). Manually curated GO terms associated withthe predicted orthologues can also be retrievedfor functional inference. The example in Figure 4shows the Inparanoid orthologues for the Sz. pombegene product ‘metaxin’ and the GO terms whichhave been associated with any of the annotatedorthologues. Links to YOGY are provided fromthe ‘Database Cross-References’ section of everySz. pombe GeneDB gene page.
Additional topics
Additional topics included the UniProt universalprotein resource, Intact protein interaction data-base, Interpro (integrated protein family database)provided by Viv Junker (UniProtKB, Swiss Insti-tute of Bioinformatics, Switzerland; http://www.ebi.uniprot.org/index.shtml). Jurg Bahler alsoprovided a demonstration of the Proteome databasefrom Biobase (PombePD), tools and resourcesfor the mining of microarray data, a compar-ative analysis of microarray studies (Margueratet al., 2006) and a program for fission yeast PCRprimer design (Penkett et al., this issue). FinallyBobby-Joe Breikreutz and Lorrie Boucher (TheBioGRID, University of Toronto, Canada) pro-vided a preliminary analysis and a progress updateof the consortium physical and genetic interac-tion literature curation effort for the fission yeast(http://www.thebiogrid.org/).
Future prospects
Feedback indicated that the majority of the partic-ipants found the workshop ‘very relevant’ to theirresearch. The content of future workshops will beextended and developed to cope with the expandingcomputing needs and community expectations. Rel-evant workshop content will also form a module ina new Wellcome Trust Advanced Course ‘Genome-wide Approaches with Fission Yeast’, organized by
Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea
Fission yeast computing workshop 911
Figure 4. The Inparanoid results for Sz. pombe SPAC589.04 (Metaxin) in YOGY. YOGY graphically displays the phylogeneticdistribution and copy number of orthologues. The GO terms manually associated with the gene product and its predictedorthologues are also retrieved
Jurg Bahler, Juan Mata and Valerie Wood. Thispractical course will take place at the Sanger Insti-tute in October 2007 and should enable participantsto acquire the necessary skills to pursue large-scaleand functional genomics approaches on a routinebasis in their own laboratories (for details, seehttp://www.wellcome.ac.uk/advancedcourses).
AcknowledgementsThe author would like to thank Martin Aslett, Jurg Bahler,Rob Finn, Midori Harris, Viv Junker and Jane Lomax for
preparing the course and the manual; Andy Giddings forreprographics support; and Adrian Tivey, Paul Mooney,Chris Penkett and Katja Kivinen for technical and teachingsupport. Additional thanks to Martin Aslett and Jurg Bahlerfor proofreading comments.
References
Apweiler A, Bairoch A, Wu CH, et al. 2004. UniProt: theuniversal protein knowledgebase. Nucleic Acids Res 32:D115–119.
Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea
912 V. Wood
Ashburner M, Ball CA, Blake JA, et al. 2000. Gene Ontology:tool for the unification of biology. Nature Genet 25:25–29.
Aslett M, Wood V. 2006. Gene Ontology annotation status of thefission yeast genome: preliminary coverage approaches 100%.Yeast 23: (this issue).
Finn RD, Mistry J, Schuster-Bockler B, et al. 2006. Pfam: clans,web tools and services. Nucleic Acids Res 34: D247–251.
Hertz-Fowler C, Peacock CS, Wood V, et al. 2004. GeneDB: aresource for prokaryotic and eukaryotic organisms. NucleicAcids Res 32: D339–343.
Khatri P, Draghici S, Ostermeier GC, Krawetz SA. 2002. Profilinggene expression using onto-express. Genomics 79: 266–270.
Li L, Stoeckert CJ Jr, Roos DS. 2003. OrthoMCL: identificationof orthologue groups for eukaryotic genomes. Genome Res 13:2178–2189.
Marguerat S, Jensen TS, de Lichtenberg U, et al. 2006. Themore the merrier: comparative analysis of microarray studieson cell cycle-regulated genes in fission yeast. Yeast 23:261–277.
O’Brien KP, Remm M, Sonnhammer EL. 2005. Inparanoid: acomprehensive database of eukaryotic orthologues. NucleicAcids Res 33: D476–D480.
Penkett CJ, Morris JA, Wood V, Bahler J. 2006. YOGY: a web-based, integrated database to retrieve protein and associatedGene Ontology terms. Nucleic Acids Res 34: W330–334.
Penkett CJ, Birtle ZE, Bahler J. 2006. Simplified primer design forPCR-based gene targeting and microarray primer database: twoweb tools for fission yeast. Yeast 23: (this issue).
The Gene Ontology Consortium. 2004. The Gene Ontology(GO) database and informatics resource. Nucleic Acids Res 32:D258–261.
Tatusov RL, Fedorova ND, Jackson JD, et al. 2003. The COGdatabase: an updated version includes eukaryotes. BMCBioinformatics 4: 41.
Wood V. 2006. Schizosaccharomyces pombe comparativegenomics; from sequence to systems. In Comparative Genomicsusing Fungi as Models, Sunnerhagen P, Piskur J (eds). Springer-Verlag: Heidelberg.
Wheeler DL, Barrett T, Benson DA, et al. 2005. Databaseresources of the National Center for Biotechnology Information.Nucleic Acids Res 33: D39–D45.
Wu CH, Apweiler R, Bairoch A, et al. 2006. The UniversalProtein Resource (UniProt): an expanding universe of proteininformation. Nucleic Acids Res 34: D187–191.
Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea