how to get the most from fission yeast genome data: a report from the 2006 european fission yeast...

8
Yeast Yeast 2006; 23: 905–912. Published online in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/yea.1419 Report How to get the most from fission yeast genome data: a report from the 2006 European Fission Yeast Meeting Computing Workshop Valerie Wood* Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1HH, UK *Correspondence to: Valerie Wood, Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1HH, UK. E-mail: [email protected] Received: 24 August 2006 Accepted: 29 August 2006 Abstract A fission yeast computing workshop ‘How to get the most from the fission yeast genome data’ was run as a satellite to the European Fission Yeast Meeting. The broad aims of the workshop were to provide fission yeast bench biologists with a set of tools and protocols to query the fission yeast genome data in specific ways, in order to extract biologically meaningful information of interest, which can be tailored to the needs of individual research projects. A description of the workshop content is provided and a selection of the tools presented are reviewed. Copyright 2006 John Wiley & Sons, Ltd. Keywords: genome analysis; Gene Ontology; Pfam; Sz. pombe ; annotation; cura- tion; database; fission yeast, bioinformatics Introduction At present, the primary publicly available curated repositories relevant to fission yeast data are the Model Organism Database (MOD) Sz. pombe GeneDB (Hertz-Fowler et al., 2004) and the asso- ciated Gene Ontology (GO) data (The Gene Ontol- ogy Consortium, 2004); the Uniprot Knowledge- base and ancillary databases (i.e. Interpro, Intact) (Wu et al., 2006) and the Pfam protein family database (Finn et al., 2006). The fission yeast com- puting workshop was designed to demonstrate how these data resources are integrated via the MOD and to encourage their use by bench biologists, not only as static data repositories for simple text string or Accession No. retrieval-based queries, but also as data-mining tools for the identifica- tion of potentially interesting gene sets based on shared features and concepts. Particular emphasis was placed on: 1. Using GO for the retrieval of gene sets with shared known or inferred molecular function, biological process or cellular component (loca- tion or complex) at different levels of granular- ity of knowledge (depth of annotation). 2. Identifying statistically over-represented GO terms for gene sets of interest. 3. Identifying protein families, protein domains and potential orthologues, and their associ- ated functional annotations and species distri- butions. 4. Performing combined queries based on these and other shared features and properties (protein size, chromosome, intron number, etc.). 5. Identifying functional clues for unstudied genes based on sequence similarity and other contex- tual information. 6. Downloading query results in different formats. This feature provides a snapshot of a selec- tion of the tools and exercises which were described in detail in the workshop. The com- plete workshop manual can be downloaded from http://www.sanger.ac.uk/Projects/S pombe/ documentation.shtml Copyright 2006 John Wiley & Sons, Ltd.

Upload: valerie-wood

Post on 06-Jul-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

YeastYeast 2006; 23: 905–912.Published online in Wiley InterScience(www.interscience.wiley.com) DOI: 10.1002/yea.1419

Report

How to get the most from fission yeast genome data:a report from the 2006 European Fission YeastMeeting Computing WorkshopValerie Wood*Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1HH, UK

*Correspondence to:Valerie Wood, Wellcome TrustSanger Institute, Hinxton,Cambridge CB10 1HH, UK.E-mail: [email protected]

Received: 24 August 2006Accepted: 29 August 2006

AbstractA fission yeast computing workshop ‘How to get the most from the fission yeastgenome data’ was run as a satellite to the European Fission Yeast Meeting. Thebroad aims of the workshop were to provide fission yeast bench biologists with a setof tools and protocols to query the fission yeast genome data in specific ways, in orderto extract biologically meaningful information of interest, which can be tailored tothe needs of individual research projects. A description of the workshop content isprovided and a selection of the tools presented are reviewed. Copyright 2006 JohnWiley & Sons, Ltd.

Keywords: genome analysis; Gene Ontology; Pfam; Sz. pombe; annotation; cura-tion; database; fission yeast, bioinformatics

Introduction

At present, the primary publicly available curatedrepositories relevant to fission yeast data arethe Model Organism Database (MOD) Sz. pombeGeneDB (Hertz-Fowler et al., 2004) and the asso-ciated Gene Ontology (GO) data (The Gene Ontol-ogy Consortium, 2004); the Uniprot Knowledge-base and ancillary databases (i.e. Interpro, Intact)(Wu et al., 2006) and the Pfam protein familydatabase (Finn et al., 2006). The fission yeast com-puting workshop was designed to demonstrate howthese data resources are integrated via the MODand to encourage their use by bench biologists,not only as static data repositories for simple textstring or Accession No. retrieval-based queries,but also as data-mining tools for the identifica-tion of potentially interesting gene sets based onshared features and concepts. Particular emphasiswas placed on:

1. Using GO for the retrieval of gene sets withshared known or inferred molecular function,

biological process or cellular component (loca-tion or complex) at different levels of granular-ity of knowledge (depth of annotation).

2. Identifying statistically over-represented GOterms for gene sets of interest.

3. Identifying protein families, protein domainsand potential orthologues, and their associ-ated functional annotations and species distri-butions.

4. Performing combined queries based on theseand other shared features and properties (proteinsize, chromosome, intron number, etc.).

5. Identifying functional clues for unstudied genesbased on sequence similarity and other contex-tual information.

6. Downloading query results in different formats.This feature provides a snapshot of a selec-tion of the tools and exercises which weredescribed in detail in the workshop. The com-plete workshop manual can be downloaded fromhttp://www.sanger.ac.uk/Projects/S pombe/documentation.shtml

Copyright 2006 John Wiley & Sons, Ltd.

906 V. Wood

GeneDB

GeneDB (http://www.genedb.org/genedb/pombe/) is a multi-species database which hostssequence, annotation and curation for the organ-isms sequenced by the Wellcome Trust SangerInstitute Pathogen Sequencing Unit (PSU). TheGeneDB module presented by Martin Aslett(Pathogen Sequencing Unit, Sanger Institute, UK)and Valerie Wood (Fission Yeast Functional Geno-mics Group, Sanger Institute, UK) provided a gen-eral overview of the annotation and features anddemonstrated the analysis tools available to search,browse and download fission yeast genome data.

The tools presented included the ‘Boolean queryinterface’, which allows users to build comprehen-sive combined queries based on a number of anno-tated features and properties (http://www.genedb.org/gusapp/servlet?page=boolq&organism=pombe). The ‘Boolean query interface’ providesa powerful tool for the identification of candidategene sets. Query sets can be selected and combinedusing ‘AND’ or ‘OR’ operators through a simpleweb interface. The illustrated example in Figure 1shows the selection of all genes with introns com-bined with all genes with the GO component termnucleus and the Pfam protein family WD repeat.To perform this query (or a similar query), firstselect the operator ‘AND’ twice, then select thethree queries, ‘Proteins containing a specific Pfamdomain’, ‘Genes with a specific GO component’and ‘Predicted genes with a range of exon number’.Select ‘Proceed to next step’; this will provide therange of options visible in Figure 1, which shouldbe selected from the pull-down menus, or typed inthe text box provided, before submitting the finalquery.

The results from Boolean query searches can bedownloaded individually or combined (added, sub-tracted or intersected) with other queries via the‘Query history’ feature, which is accessible fromthe bottom of the results page (http://www.genedb.org/gusapp/servlet?page=history). Using the‘Download’ feature from the ‘Query History’, theresults of queries can be downloaded as gene lists,spreadsheets containing user-defined fields (genenames, products, coordinates) or as Fasta formatsequence of the CDS, upstream or downstreamregions.

The Gene Ontology

The Gene Ontology (GO) consortium is a collab-orative open source project to develop controlledvocabularies to provide consistent descriptionsof gene products (The Gene Ontology Consor-tium, 2004; http://www.geneontology.org/). GOprovides ontologies for three biological domains,molecular function, biological process and cellu-lar component. GO terms are used to annotategene products consistently both within and betweenorganisms, to allow the manual and automatedretrieval of groups of similarly annotated genes.

The GO module presented by Midori Harris (GOEditorial Office, European Bioinformatics Institute,UK) provided an overview of the GO project. Thisincluded how both the ontology and the annotationsare implemented, structured and updated. The prin-ciples of GO and its application to fission yeast aredescribed in more detail in Aslett and Wood (thisissue). An annotation status update presented byValerie Wood provided the current GO annotationcoverage for fission yeast and showed how geneassociations to GO terms are derived from the lit-erature, sequence similarity searches and electronicmappings. The current status of the GO annotationfor fission yeast is also provided in Aslett and Wood(this issue).

A description and demonstration of the AmiGOGO browser was provided by Jane Lomax (GOEditorial Office, European Bioinformatics Institute,UK), who described the basic principles for retriev-ing gene sets, based on shared annotation from bothfission yeast and other organisms. The fission yeastGO data is available in AmiGO via the GO con-sortium website (http://www.godatabase.org/cgi-bin/amigo/go.cgi?) and the GeneDB implementa-tion of AmiGO (http://www.genedb.org/amigo/perl/go.cgi?species db=GeneDB Spombe). Theuse of AmiGO for browsing, searching and retriev-ing GO terms was demonstrated.

The AmiGO screenshot in Figure 2a shows theresults of a search using the ‘Terms’ option fromthe Amigo front page. This search identifies allGO terms containing a specific sub-string, in thisexample a search for ‘glucan synthase’. This pro-vides a results list which allows the selection ofa specific term ‘alpha-1,3-glucan synthase activity(GO:0 047 657)’. Clicking on the ‘tree icon’ at theleft of the term in the results provides the ‘tree’view of the GO term shown in Figure 2b. This view

Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea

Fission yeast computing workshop 907

Figure 1. This screen shot shows the selection of query options in the GeneDB ‘Boolean Query Interface’(http://www.genedb.org/gusapp/servlet?page=boolq&organism=pombe) to retrieve all gene products with aPfam WD-repeat domain and an annotation to ‘nucleus’ which contain at least one intron. To repeat this query (or performa similar query), first select the operator ‘AND’ twice, then select the three queries ‘Proteins containing a specific Pfamdomain’, ‘Genes with a specific GO component’ and ‘Predicted genes with a range of exon number’. Select ‘Proceed tonext step’; this will provide the range of options visible in Figure 1, which should be selected from the pull-down menus,or typed in the text box provided, before submitting the final query

shows a browsable ‘GO tree’ of the term alpha-1,3-glucan synthase activity and all of its parents. Thenumber of gene products annotated to a term (orany of its child terms) in the organism or databasebeing searched are shown in parentheses next tothe term name.

Another common application for genome-wideGO annotation is to identify statistically over-represented GO terms among groups of genes.The guided exercises included the use of the

Onto-Express software to identify over-representedGO terms in user-defined gene lists (e.g. geneupregulated in a microarray experiment; Khatriet al., 2002; http://vortex.cs.wayne.edu/projects.htm).

The Pfam protein family database

The Pfam protein family database provides acomprehensive resource for the identification of

Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea

908 V. Wood

(a)

(b)

Figure 2. A ‘term’ search in the AmiGO GO browser. 1, The AmiGO screenshot in Figure 2a shows the results of a searchusing the ‘Terms’ option from the Amigo front page. 2, This search identifies all GO terms containing a specific sub-string,in this example a search for ‘glucan synthase’. 3, This allows the selection of the more granular term ‘alpha-1,3-glucansynthase’ from the results list. Clicking on the ‘tree icon’ at the left of the term in the results provides the tree view of theGO term shown in Figure 2b. 4, This view shows a browsable GO tree of the term and all of its parents. 5, The numberof gene products annotated to a term (or any of its child terms) in the organism or database being searched are shown inparentheses. 6, Terms can be expanded to show additional child terms by clicking on the ‘+’ icon

protein families and domains (Finn et al., 2006;http://www.sanger.ac.uk/Software/Pfam/). ThePfam database is made of Pfam-A, high-qualitycurated alignments and automatically generatedPfam-B alignments. The Pfam module was pre-sented by Rob Finn (Pfam, Sanger Institute, UK)and included a description of the Pfam proteinfamily pages, the current Pfam coverage and anintroduction to a newer Pfam resource, Pfam Clans.

Individual Pfam protein family entries provideaccess to the protein alignment for the family; agraphical view of all proteins with a given domainor combination of domains (domain architecture); aview of the species distribution for a given domainor family and the phylogenetic tree. These entrypoints are illustrated in Figure 3, which shows thePfam graphical view of the Lid2 protein and thePfam entry for ‘JmjC’ (Accession No. PF02373),

Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea

Fission yeast computing workshop 909

(a)

(b)

Figure 3. Pfam domain organization for Lid2 and the JmjC Pfam entry. Links to Pfam are provided from the ‘DomainInformation’ section of every GeneDB gene page. Figure 3a shows the graphical view of the Pfam domain organization forthe protein Lid2. 1. The meaning of the different icons is described in the key, 2. Clicking on the ‘domain icon’ for ‘JmjC’will take you to the full entry for this family (PF02373) shown in Figure 3b. Individual Pfam protein family entries provideaccess to: 3. the protein alignment for the family; 4. a graphical view of all proteins with a given domain or combination ofdomains (domain architecture); 5. a view of the species distribution for a given domain or family; 6. the phylogenetic tree.These entry points are illustrated in Figure 3b

Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea

910 V. Wood

which is one of the domains present in Lid2. Pfamviews of individual proteins can be accessed fromthe ‘view Pfam domain structure for this geneproduct’ link in the ‘Domain Information’ sectionof every GeneDB gene page. For Lid2, this linkprovides the graphical view of the Pfam domainorganization for the protein (Figure 3a). Clickingon the ‘domain icon’ for ‘JmjC’ will take you tothe full Pfam entry for this family (Figure 3b).

Fission yeast has a Pfam-A protein family cov-erage of 77% compared to 72% for budding yeast.This coverage is higher than for any other fullyrepresented eukaryotic proteome and is partly dueto the greater conservation with higher eukary-otes and the apparent lower number of speciesspecific proteins. This extensive coverage makesthe Pfam-A features illustrated here available forthe vast majority (3838) of fission yeast proteins.Even in the absence of a Pfam-A domain, usefulinformation including the presence of conservedregions (identified by the Pfam-B uncurated align-ments), low-complexity regions, coiled-coil regionsand signal peptides can be accessed via Pfam imagemaps.

The on-line tutorial provided by Rob Finnincludes exercises to query the proteome basedon domain combinations and to compare pro-teomes of different species to identify commonand unique domains. A simple XML interface toproduce publication quality Pfam graphics is alsodemonstrated. The Pfam tutorial can be accessed athttp://www.sanger.ac.uk/Users/rdf/PombeTuto-rial/

YOGY

It is common practice to obtain functional cluesfor proteins of interest by detecting homologousproteins in other organisms and identifying theirroles. However, the accurate identification of ortho-logues is not trivial, especially when levels ofsequence similarity are low. Jurg Bahler (FissionYeast Functional Genomics Group, Sanger Insti-tute, UK) presented a module which included adescription of YOGY, a web-based resource forretrieving orthologues from fission yeast and otherfully sequenced genomes (Penkett et al., 2006;http://www.sanger.ac.uk/PostGenomics/S pombe/YOGY/). Using a gene or protein iden-tifier as a query, this resource can be used

to retrieve orthologues, using the major ortho-logue predictors (KOGS, Inparanoid, Homologeneand OrthoMcl) and a manually curated inven-tory of known or predicted orthologues betweenSz. pombe and S. cerevisiae to provide a compre-hensive list of orthologue candidates for furtherinspection (Tatusov et al., 2006; O’Brien et al.,2005; Wheeler et al., 2005; Li et al., 2003; Wood,2006). Manually curated GO terms associated withthe predicted orthologues can also be retrievedfor functional inference. The example in Figure 4shows the Inparanoid orthologues for the Sz. pombegene product ‘metaxin’ and the GO terms whichhave been associated with any of the annotatedorthologues. Links to YOGY are provided fromthe ‘Database Cross-References’ section of everySz. pombe GeneDB gene page.

Additional topics

Additional topics included the UniProt universalprotein resource, Intact protein interaction data-base, Interpro (integrated protein family database)provided by Viv Junker (UniProtKB, Swiss Insti-tute of Bioinformatics, Switzerland; http://www.ebi.uniprot.org/index.shtml). Jurg Bahler alsoprovided a demonstration of the Proteome databasefrom Biobase (PombePD), tools and resourcesfor the mining of microarray data, a compar-ative analysis of microarray studies (Margueratet al., 2006) and a program for fission yeast PCRprimer design (Penkett et al., this issue). FinallyBobby-Joe Breikreutz and Lorrie Boucher (TheBioGRID, University of Toronto, Canada) pro-vided a preliminary analysis and a progress updateof the consortium physical and genetic interac-tion literature curation effort for the fission yeast(http://www.thebiogrid.org/).

Future prospects

Feedback indicated that the majority of the partic-ipants found the workshop ‘very relevant’ to theirresearch. The content of future workshops will beextended and developed to cope with the expandingcomputing needs and community expectations. Rel-evant workshop content will also form a module ina new Wellcome Trust Advanced Course ‘Genome-wide Approaches with Fission Yeast’, organized by

Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea

Fission yeast computing workshop 911

Figure 4. The Inparanoid results for Sz. pombe SPAC589.04 (Metaxin) in YOGY. YOGY graphically displays the phylogeneticdistribution and copy number of orthologues. The GO terms manually associated with the gene product and its predictedorthologues are also retrieved

Jurg Bahler, Juan Mata and Valerie Wood. Thispractical course will take place at the Sanger Insti-tute in October 2007 and should enable participantsto acquire the necessary skills to pursue large-scaleand functional genomics approaches on a routinebasis in their own laboratories (for details, seehttp://www.wellcome.ac.uk/advancedcourses).

AcknowledgementsThe author would like to thank Martin Aslett, Jurg Bahler,Rob Finn, Midori Harris, Viv Junker and Jane Lomax for

preparing the course and the manual; Andy Giddings forreprographics support; and Adrian Tivey, Paul Mooney,Chris Penkett and Katja Kivinen for technical and teachingsupport. Additional thanks to Martin Aslett and Jurg Bahlerfor proofreading comments.

References

Apweiler A, Bairoch A, Wu CH, et al. 2004. UniProt: theuniversal protein knowledgebase. Nucleic Acids Res 32:D115–119.

Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea

912 V. Wood

Ashburner M, Ball CA, Blake JA, et al. 2000. Gene Ontology:tool for the unification of biology. Nature Genet 25:25–29.

Aslett M, Wood V. 2006. Gene Ontology annotation status of thefission yeast genome: preliminary coverage approaches 100%.Yeast 23: (this issue).

Finn RD, Mistry J, Schuster-Bockler B, et al. 2006. Pfam: clans,web tools and services. Nucleic Acids Res 34: D247–251.

Hertz-Fowler C, Peacock CS, Wood V, et al. 2004. GeneDB: aresource for prokaryotic and eukaryotic organisms. NucleicAcids Res 32: D339–343.

Khatri P, Draghici S, Ostermeier GC, Krawetz SA. 2002. Profilinggene expression using onto-express. Genomics 79: 266–270.

Li L, Stoeckert CJ Jr, Roos DS. 2003. OrthoMCL: identificationof orthologue groups for eukaryotic genomes. Genome Res 13:2178–2189.

Marguerat S, Jensen TS, de Lichtenberg U, et al. 2006. Themore the merrier: comparative analysis of microarray studieson cell cycle-regulated genes in fission yeast. Yeast 23:261–277.

O’Brien KP, Remm M, Sonnhammer EL. 2005. Inparanoid: acomprehensive database of eukaryotic orthologues. NucleicAcids Res 33: D476–D480.

Penkett CJ, Morris JA, Wood V, Bahler J. 2006. YOGY: a web-based, integrated database to retrieve protein and associatedGene Ontology terms. Nucleic Acids Res 34: W330–334.

Penkett CJ, Birtle ZE, Bahler J. 2006. Simplified primer design forPCR-based gene targeting and microarray primer database: twoweb tools for fission yeast. Yeast 23: (this issue).

The Gene Ontology Consortium. 2004. The Gene Ontology(GO) database and informatics resource. Nucleic Acids Res 32:D258–261.

Tatusov RL, Fedorova ND, Jackson JD, et al. 2003. The COGdatabase: an updated version includes eukaryotes. BMCBioinformatics 4: 41.

Wood V. 2006. Schizosaccharomyces pombe comparativegenomics; from sequence to systems. In Comparative Genomicsusing Fungi as Models, Sunnerhagen P, Piskur J (eds). Springer-Verlag: Heidelberg.

Wheeler DL, Barrett T, Benson DA, et al. 2005. Databaseresources of the National Center for Biotechnology Information.Nucleic Acids Res 33: D39–D45.

Wu CH, Apweiler R, Bairoch A, et al. 2006. The UniversalProtein Resource (UniProt): an expanding universe of proteininformation. Nucleic Acids Res 34: D187–191.

Copyright 2006 John Wiley & Sons, Ltd. Yeast 2006; 23: 905–912.DOI: 10.1002/yea