i.8 development of crop ontology for sharing crop phenotypic...

12

Click here to load reader

Upload: lenhi

Post on 28-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

I.8 Development ofcrop ontology for sharing crop

phenotypic information

Rosemary Shrestha and Guy F Davenport (CRIL–CIMMYT, Mexico),Richard Bruskiewich (CRIL–IRRI, The Philippines) and

Elizabeth Arnaud (Bioversity, Italy)R

Oko

no

Page 2: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information
Page 3: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

171

IntroductionThe amount of biological and genetic information has increased dramatically with the advent of high-throughput data collection in the field of molecular biology and biotechnology. A large number of biological databases have been designed and made available to facilitate access to such data. However, inconsistencies in terminology, syntax and semantics present significant obstacles to sharing data and knowledge among disparate researchers and research groups (Hughes et al, 2008). For researchers, it is highly necessary to have seamless access to various distributed and interrelated data sources such as genetic data, trait data, and different types of experimental data to ask or explore biologically interesting questions. To overcome the problem of inconsistencies between data sources and to share a wealth of knowledge among researchers and research groups, several bio-ontologies are being constructed.

The word ontology is derived from the Greek words ‘ontos’, meaning ‘to be’ and ‘logos’, meaning ‘word’.An ontology is defined as a formal specification of a shared conceptualisation (Borst, 1997). In other words, an ontology is a controlled vocabulary that describes types of entities (objects) and the relationships between them within a particular discipline in a formal manner. For example, a Linnean classification is a kind of systematics ontology (Mabee et al, 2007). The classes (types) are taxa at various ranks, each have a formal definition and a specific formal sub-typing relationship to each other. However, a Linnean classification has a single hierarchy because it represents the single ancestor-descendant branching tree of life. Each child type (ie, descendant) has a single parent type (ie, ancestor) whose properties it inherits.

In bioinformatics, an ontology is a formal representation of a set of concepts within a domain and the relationship between those concepts. This provides a shared vocabulary that can be used to model the domain in terms of the types of object and/or concept that exist, and their properties and relations. Wikipedia1 describes several common components of ontology as:

• Individuals (instances or objects), classes (sets, collections, concepts, types of objects, or kinds of things)

I.8 Development of crop ontology for sharing crop phenotypic information

Rosemary Shrestha and Guy F Davenport (CRIL–CIMMYT, Mexico), Richard Bruskiewich (CRIL–IRRI, The Philippines) and Elizabeth Arnaud (Bioversity, Italy)

• Attributes (aspects, properties, features, characteristics or parameters)

• Relations (ways in which classes and individuals can be related to one another)

• Function terms (complex structures formed from certain relations that can be used in place of an individual term in a statement)

• Restrictions (formally stated descriptions of what must be true in order for some assertion to be accepted as input)

• Rules (statements in the form of an if-then sentence that describe the logical inferences that can be drawn from an assertion in a particular form)

• Axioms (assertions, including rules, in a logical form that together comprise the overall theory that the ontology describes in its domain of application)

• Events (the changing of attributes or relations).

In contrast to systematics, ontology is more complex because it involves multiple parents with the same or different relationships. Therefore, the structure of an ontology is a strict hierarchy, but it is represented by an acyclic graph in which multiple types can have parents and different relationships between them.

Various types of bio-ontology are available in a number of biological domains, each containing hundreds to thousands of standardised terms. Gene Ontology (GO)2 is the most prevalent bio-ontology which describes the gene and gene products in several model organisms with controlled vocabulary. Plant Ontology (PO),3 Trait Ontology (TO),4 Phenotype and Trait Ontology (PATO)5 and Sol Genomics Network Ontology (SGN ontology)6 have been developed mainly for accessing plant science information. The PO mainly describes structure, anatomy and growth stages of plants. Gramene, which is a database of grass genomes providing comparative genomics tools for the grasses, is more involved in developing TO in collaboration with the Plant Ontology Consortium (Jaiswal et al, 2002). To capture

1 http://en.wikipedia.org/wiki/Ontology_(computer_science)2 http://www.geneontology.org/3 http://www.plantontology.org/index.html4 http://www.gramene.org/5 http://bioontology.org/wiki/index.php/PATO:Main_Page6 http://sgn.cornell.edu/tools/onto/

Page 4: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

172

qualitative and quantitative information about phenotypes, the PATO has been constructed. This is an ontology of phenotypic qualities that can be used to capture the differences between wild and mutant phenotypes of all organisms.7 SGN Ontology has recently been developed for Solanaceae phenotype information (Menda et al 2008).

Objective of the crop ontologyThe Generation Challenge Programme (GCP) is an agricultural research consortium, hosted by International Agricultural Research Centres of the Consultative Group on International Agricultural Research (CGIAR) and involves about 22 core research institutes in partnership with many additional external collaborators. The GCP research agenda focuses on comparative genomics-driven crop improvement of crops and high-throughput molecular characterisation of genetic resources to introduce favourable alleles into plant breeding programmes. Within GCP and the CGIAR, the volume of agriculture-related information is increasing enormously. The CGIAR has accumulated historic crop data that are related to phenotype, breeding, germplasm, pedigree, traits etc, for the past six or seven decades. However, available data are not systematised, presenting researchers with problems in data management, retrieval and accessibility. Therefore, the GCP is deploying a crop ontology (CO) that characterises the computational architecture of a knowledge-based system. This will provide researchers and end users with a user-friendly tool to facilitate the use of comparative biology infrastructure for crops such as rice, maize and wheat, and will also facilitate the use of molecular biology to accelerate and focus the crop improvement efforts that are emerging in several institutions around world.

Scope of the crop ontologyBecause the CO intends to identify ontology terms globally and uniformly, all GCP domain ontology terms managed within GCP-compliant systems, whether specifically created by GCP or adopted from the pre-existing work of external groups, are assigned a globally unique CO term identifier. This identifier contains the prefix CO_ followed by a three digit number denoting the index of the ontology from which the term was adopted, then a decimal point and, finally, an alphanumeric suffix, which is the specific identifier of the specific term or concept. The scope of the CO indexes is provided in Table 1, and the fully indexed inventory of CO is published at http://pantheon.generationcp.org/index.php

In the case of terms derived from public ontologies, such as GO or PO, the alphanumeric suffix is simply set to the community-assigned external numeric identifier of the term within that external public ontology. In the case of external concept catalogues lacking such community-assigned identifiers but being converted to a managed ontology within the CO, and for GCP crop developed ontology terms, the CO team assigns sequential numeric identifiers.

To date, nine ontology domains have been created for the CO (Table 1). The General Germplasm and Passport Ontology Domain is one of the most important ontology domains of the GCP crop ontology that is adapted from common concepts relating to genetic resources, in particular, crop descriptors.8 It includes codes for passport data, management data and attributes pertaining to a germplasm accession – the technical term used in the crop science domain to describe a single seed sample or plant clone. The vocabularies adopted come from the Multi-Crop Passport Descriptors developed jointly by FAO (Food and Agriculture Organization of the United Nations) and IPGRI (International Plant Genetic Resources Institute; Bioversity International, 2009). In addition, controlled vocabularies from the International Crop Information System (ICIS)9 have been incorporated.

The Taxonomic Ontology Domain is a collection of plant taxonomy ontologies to be adopted from external taxonomy databases such as GRIN (Germplasm Resources Information Network), NCBI (National Center for Biotechnology Information), UniProt (Protein Knowledgebase) and PO taxonomic vocabularies. The Plant Ontology Consortium (POC; Jaiswal et al, 2002) has provided the PO database for plant structure, anatomy,

7 http://www.bioontology.org/wiki/index.php/PATO:Main_Page8 http://www.bioversityinternational.org/scientific_

information/themes/germplasm_documentation/crop_descriptors/

9 http://www.icis.cgiar.org/icis/index.php/Main_Page

Table 1. Scope and index ranges for the CO domains.

Prefix Crop Ontology Domain

000 GCP Domain Model Ontology010-089 General Germplasm and Passport Ontology090-099 Taxonomic Ontology100-299 Plant Anatomy and Development Ontology300-499 Phenotype and Trait Ontology Ontology500-699 Structural and Functional Genomic Ontology700-799 Location and Environmental Ontology800-899 General Science Ontology900-999 Other (Sub-domain or Site-Specific) Ontology

Rosemary Shrestha et al

Page 5: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

172 173

Methods for developing crop ontologyAs an initial step, a comprehensive domain-focused search was undertaken to identify existing public domain ontology or equivalent catalogues of concepts representing key GCP data entities and attributes and covering the full range of information for crop science. Specific examples of such public domain ontology or equivalent concept catalogues include GO (Ashburner and Lewis, 2002), POC (Jaiswal et al, 2002), MIAME-Plant (Zimmermann et al, 2006) and the FAO/IPGRI Multi-Crop Passport Descriptors.

At each level of the crop ontology, specific trait type information (eg, synonyms, trait description) was included for each term. It allows three types of relationship between terms: ‘is_a’ (eg, the plant height is a vigour trait or shoot anatomy and morphology-related trait), ‘part_of’ (eg, the stem length is part of the plant height) as in GO, and ‘derived_from’ (eg abscisic acid content to sugar content ratio is derived from parent terms abscisic acid content and sugar content). Information was obtained from several resources including the International Maize Information System (IMIS),15 the International Rice Information System (IRIS),16 the International Wheat Information System (IWIS),17 the Musa Germplasm Information System (MGIS),18 the ICRISAT (International Crops Research Institute for the Semi-Arid Tropics) information system19 for sorghum and chickpeas, the CIP (Centro Internacional de la Papa; International Potato Center) information system20 for potatoes, crop descriptors for traits, GCP datasets available in the GCP Central Registry,21 reference books and published articles. Descriptions for most of the traits were collected from breeders, physiologists, agronomists and researchers involved in specific projects funded by GCP.

To develop ontologies, ontology editors must be used to facilitate the process. Ontology editors allow curators to browse, search, visualise and edit ontologies. The OBO-Edit is the most commonly used bio-ontology editor

morphology and developmental stages (Avraham et al, 2008; Ilic et al, 2007). The Plant Anatomy and Development Ontology Domain contains the adopted POC ontologies along with the crop specific ontologies developed for GCP mandate-crops such as chickpeas (Cicer arietinum), maize (Zea mays), bananas and plantains (Musa spp), pigeon peas (Cajanus cajan), potatoes (Solanum tuberosum), rice (Oryza sativa), sorghum (Sorghum spp) and wheat (Triticum spp).

Another focused domain group is the Phenotype and Trait Ontology Domain which includes the GCP crop-specific trait ontology along with the TO. The goal of developing a crop-specific anatomy, development and trait ontology is to provide exact meanings of terms that are related to phenotypes as described by crop physiologists, plant breeders and other crop scientists. So far, crop-specific trait ontologies have been developed for maize, rice and wheat. The development of ontologies for sorghum, chickpeas and Musa is ongoing. The rice mutant ontology is also integrated into the Phenotype and Trait Ontology Domain of CO. The Open Biomedical Ontologies (OBO)-formatted ontology files for some of these crops are currently publicly available online10 and described on the Pantheon web site11 (Figure 1). The GCP team is collaborating closely with external communities such as the POC, to harmonise terms globally.

The Structural and Functional Genomics Ontology Domains consolidate many of the cellular and molecular level process ontologies including GO (Ashburner and Lewis, 2002) and the Ontology for Biomedical Investigations12 The Location and Environment Ontology includes location metadata like country lists, geographic information system metadata, and environmental descriptors relating to the context within which crops grow. Included in this category are public efforts such as the Environment Ontology project.13 The General Science Ontology contains physical and chemical property data for chemical species, like the Chemical Entities of Biological Interest (CHEBI)14 controlled vocabularies.

Figure 1. An example of the properties associated with the trait name ‘plant height’. The properties describe details information about a trait name. (* represents the CO prefix which is given to the maize trait of phenotype and trait ontology domain.)

Term: plant heightID: CO_322*.0000007Namespace: maize_traitDefinition: Measurement of plant height from soil surface to the highest point in plantSynonyms: PHT, PTHT, Planth. Shoot height Dbxrefs: TO:0000207, IMIS_TRAITID:2050is_a: CO_322:0000108

10 http://cropforge.org/projects/gcpontology/11 http://pantheon.generationcp.org/index.php12 http://www.obofoundry.org/13 http://www.envoc.org14 http://www.ebi.ac.uk/chebi/15 http://seeds.irri.org/imis16 http://iris.irri.org/17 http://www.cimmyt.org/research/wheat/iwisfol/IWISFOL.htm18 http://www.crop-diversity.org/banana/19 http://www.icrisat.org/20 http://www.cipotato.org/21 http://gcpcr.grinfo.net/

Crop ontology

Page 6: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

174

(Day-Richter et al, 2007), and for the development of crop ontology, the OBO-Edit tool22 is used (Figure 2).

The GCP development team is developing additional software tools, primarily in the Java and Perl programming languages, to facilitate ontology management. These include tools to parse and convert ontology term metadata to Web Language for Ontology23 or Open Biomedical Ontology (OBO),24 and tools for storing the ontology catalogues in the Chado controlled vocabulary schema of the Genetic Model Organism Database project.25 These Java and Perl tools can be found at http://cropforge.org/projects/gcpmodels and include GCPModelToOwl, GPCOBOParser, OBOWriter, OboToChadoLoader and OwlToChadoWriter, plus a standalone and online browser OntologyViewerRCP, currently in development. Perl scripts are also available, to convert a source of data in a certain format into an OBO formatted file.

Importance of deploying the GCP crop ontologyA question may arise as to why we need the CO when so many bio-ontologies are available publicly. The reason is that the CGIAR Centres have a huge amount of historic data, including those related to germplasm, breeding, disease, phenotypes and traits, and the GCP Central Registry26 contains data related to both genotype and phenotype, that are not well covered by current ontologies. Phenotype information has traditionally been captured in a free-text manner. Free text searching also forms the basis of information mining and retrieval, but it is extremely limited because of an inherent

Figure 2. A screen capture of the crop ontology as seen within the OBO-Edit tool, displaying all the information associated with the trait ‘Java downy mildew‘.

22 http://www.oboedit.org/23 http://www.w3.org/TR/owl-features/24 http://www.obofoundry.org/25 http://www.gmod.org26 http://gcpcr.grinfo.net/

Rosemary Shrestha et al

Page 7: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

174 175

lack of accuracy and specificity (Gkoutos et al, 2004). For example, complex free text descriptions used for phenotypes are almost impossible to index and retrieve in a useful way. Advanced search tools are required to fully exploit and realise the potential of these data. To develop the current CO, terms were primarily extracted from the ICIS and data available in the GCP Central Registry. Many traits from these resources were difficult to understand just by reading. The ‘abscisic acid content/sugar content in ear leaf 2 weeks after the last irrigation’ is one examples of how researchers give names to traits. This name is given for a QTL-based drought tolerant maize trait. The exact meaning of the term is the ratio of abscisic acid to sugar content in the ear leaf as measured after two weeks of irrigation (personal communication, J-M Ribaut, GCP). In this case, it is clearly shown that two traits or attributes, ‘abscisic acid content’ and ‘sugar content’, are linked together. Moreover, the entity ‘ear leaf’, time qualifier ‘2 weeks’, and treatment method are written along with the traits. This observation indicates that such complex terms need to be broken down into a simpler form and annotated separately. In addition, a suggestion can be made for a term to cover a new trait: ‘ratio of abscisic acid content to sugar content’.

The importance of developing the CO can be seen with the following examples, which are illustrated in detail in Figure 3. In the first example (Figure 3a), the attribute (trait) ‘osmotic pressure’ is written together with the treatment method and time qualifier. It is clear that ‘osmotic pressure’ was observed at different time intervals for each treatment. However, researchers give several trait names for one trait. This is generally called “the semantic problem” that involves the meanings of words and, in this case, multiple trait names for the same trait (only treatment and time differ). Software recognises the terms as being independent and not as the interrelated terms unless the relationships between terms are explicitly defined. The CO will help alleviate this problem through standardisation of terms used to refer to crop traits and explicit definition of relationships between the same or related traits. There will still be difficulty with historical data, but incorporating synonyms into the CO will aid in the inclusion of such data.

Variables for the trait ‘grain yield’ can be observed in the IMIS database (Figure 3b). The trait has been measured in different units (Grain Yield kg, Grain Yield tons) at adjusted moisture condition for shelled (Grain Yieldkg_FieldWt, Grain YieldTons_FieldWt) and unshelled grain (Grain YieldKg_GrainWt, Grain YieldTons_GrainWt) without adjusting moisture conditions. Another interesting example is that of the wheat biotic stress

Figure 3. An example of complex trait names and variations in measurement encountered in the databases during the crop ontology development process.

a. Osmotic pressure Osmotic pressure morning Osmotic pressure morning well-watered Osmotic pressure morning intermediate stress Osmotic pressure morning stressed

Osmotic pressure afternoon Osmotic pressure afternoon well-watered Osmotic pressure afternoon intermediate stress Osmotic pressure afternoon stressed

b. Grain yield Grain YieldKg (shelled) Grain YieldKg_FieldWt (unshelled) Grain YieldKg_GrainWt (shelled)

Grain YieldTons Grain YieldTons_FieldWt Grain YieldTons_GrainWt

c. Helminthosporium sativum blotch Hsativum on leaf (0-5) % Hsativum on node

% Hsativum_Flag % Hsativum_Flag1 % Hsativum_Flag2

% Hsativum_Tiller1_Flag % Hsativum_Tiller1_Flag1 % Hsativum_Tiller2_Flag

% Hsativum_Tiller1_Flag2 % Hsativum_Tiller2_Flag1 % Hsativum_Tiller2_Flag2

Crop ontology

related trait: ‘Helminthosporium sativum (blotch disease) resistance’. In IWIS, researchers write this trait in several ways such as “%HSATIVUM_ FLAG-1”, “%HSATIVUM_TILLER1_FLAG-1” (Figure 3c). The first problem with this trait is the difficulty in understanding the meaning of the name itself. Here, the actual meaning of the term is the trait that is scored for response to blotch symptoms caused by Helminthosporium sativum and the disease was observed in different plant parts (‘entities’). It can be clearly seen that ‘leaf’, ‘node’, ‘flag leaf’, and ‘tiller’ are four different entities used for scoring the symptoms. These terms can be mapped to PO terms. Also, the trait was observed at different growth stages such as at tiller 1, tiller 2, tiller 3, flag 1, flag 2, flag 3 etc. Moreover, two types of unit (a percentage and a scale ranging from 0–5) are used to measure disease resistance. Clearly, it is very important to observe disease occurrence at different plant growth stages and also in various plant parts for the same trait for researchers. However, the practice of including methods, units, and growth stages within the trait name causes problems in data quality and in comparing these terms.

Page 8: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

176

The phenotype is the observable physical or biochemical characteristics of an organism that are determined by both genetic and environmental influences in many different dimensions such as biochemical, cellular, anatomical, behavioural etc. Whatever the dimension and granularity, most phenotypic descriptions or characters can be broken down into two parts: (i) an entity that is affected; this entity may be an enzyme, an anatomical structure or complex biological process; (ii) the qualities of that entity such as colour, size, mass, length etc that can be measured. Every entity has certain qualities that exist for as long as the entity exists. Therefore, qualities are inherent to entities.27

PATO represents qualitative phenotypic information and helps in measurements involving units. One very useful PATO concept is illustrated in Figure 4. It is apparent that the complex trait ‘young leaf chlorophyll content 2 wks after irrigation’ can not be a standard trait name and it is necessary to break it down into terms from several ontology domains. ‘Leaf’ is the observable character that can be mapped to a PO term. ‘Young leaf’ indicates both the quality of the leaf and also its growth stage. However, the growth stage is not mentioned clearly. Since the chlorophyll is measured in chlorophyll content units (CCU), the role of unit ontology can also be noted. Moreover, the temporal factor ‘2 wks post or after irrigation’ can also be observed and it can be mapped to the time factor ontology term. Importantly, this trait is also related to the external environmental factor ‘water

stress’. These observations indicate the importance of breaking a term into several simple terms and the requirement of allocating broken-down terms into various ontology domains. In this way, PATO concepts may be applied to tackle such complex traits and to annotate data correctly.

Recently, research has increasingly been focused at the molecular level, with QTLs, molecular breeding, simple sequence repeat (SSR) tools, etc being used extensively for crop phenotyping. Difference alleles in one gene can have effects on many important traits. For example, the dwarf1 gene is associated with several phenotypic traits in rice such as plant height, stem length, leaf colour, panicle length etc.28 Capturing such types of information by a controlled vocabulary will allow researchers to compare data that are stored in and between databases. With the inception of the CO project, the rice mutant ontology is integrated as a CO resource, and each mutant phenotype controlled vocabulary is now an ontology term in the GCP rice mutant ontology. To facilitate smooth data exchange across databases and data annotation, a controlled vocabulary will be used in GCP data templates29 to help control data quality.

27 http://www.bioontology.org/wiki/index.php/PATO:Main_Page28 http://www.gramene.org/db/genes/search_

gene?acc=GR:0060184).29 http://generationcp.org/bioinformatics.php?da=0526023

Figure 4. Schematic diagram conceptually representing a method for dissecting a complex phenotyping trait term and several factors that affect the phenotyping of crops. Red text represents the ontology terms broken down from the complex term ‘leaf chlorophyll content 2 wks after irrigation’ that are embedded into respective ontology domains.

PhenotypeGenotype factor (G)

EFFE

CT

S

External environmental factor (E)

Temporal factor (T)

Experiment factor (ED)

Plant structureLeafDevelopment stages leaf?Plant anatomy

Young leaf ‘quality’

Unit for chlorophyll measurement

Value of measurement

Plant Ontology

Qualities andUnits Ontology

- Markers/alleles - Sequence Ontology

- Treatment - Location- Climatic variable- Growth conditions- Water stress- Agronomy management

- Time Ontology 2 wks post irrigation

Experimental Design Ontology

Chlorophyll content

PATO Qualifier

Phenotypic qualities

Assessment methods or evidence ontology(eg ICIS)

100 CCU

Rosemary Shrestha et al

Page 9: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

176 177

GCP platform integration of the crop ontologyWithin the GCP, a crop informatics platform is being developed to provide queries, visualisation and analyses for biodiversity analysis, comparative analysis of crop genomic data, and plant breeding (Bruskiewich et al, 2008). In the near future, the ontology project team will commission a browser interface for searching for ontology terms or a specific ontology hierarchy. A web pages tool for annotating GCP datasets with the CO is under construction. The interface will enable researchers to query a comprehensive CO database using the keywords that are related to traits, plant structure, growth stages and molecular functions. Other queries will direct users to associated GCP phenotype, genotype and other related crop datasets.

30 http://pantheon.generationcp.org under GCP Semantics ... GCP Ontology menu item

31 http://cropforge.org/projects/gcpontology/32 http://mcclintock.generationcp.org/

Figure 5. Overview of the GCP semantics. Part I shows submenus under the main menu GCP semantics available at http://pantheon.generationcp.org/index.php. Part II indicates an overview of the application part of the CO. The light shaded rectangular boxes represent major GCP target ontology domains.

GCP Models

GCP Ontology

GCP datasets

ICIS datasets

Part 1

Part 1I

PO TOLinkage to External Ontology

SP

XML Schemas

Crop Ontology Browser

List of Ontology DomainsGermplasm Ontology

Taxonomic Ontology

Plant Anatomy and Development Ontology

Phenotype and Trait Ontology

Location and Environment Ontology

General Science Ontology

Structural and Functional Genomic

Ontology

Ontology source: ICIS-based databases,GCP data templates

Developing the Ontology

Annotated data with the Crop Ontology

Koios (Zeus) Web Interface

Upload Ontology into Chado (GCP Ontology Database)

Web Query and Data Analysis using GCP

Ontology

Project

Inventory

Using Ontology

Related Links

BioMoby Data Types

Domain Model

GCP Semantics

Crop ontology

Data sources and data availabilityWebpage and downloadsTo facilitate development of the CO, a site for curators and collaborators has been developed on the Pantheon project web site30 (Figure 5). The site contains the indexed complete inventory of the CO and ‘best practices’ methodology for the CO curation. Complementing the Pantheon website is a project created for the CO on the CropForge software project management site.31 This site provides both the latest releases and previous versions of ontology flat files, which describe the terms, relationships and definitions and software tools. The site also provides forums and mailing lists for communication among collaborators. All the ontology flat files can be downloaded for local use.

The Pantheon site is mainly targeting software and ontology developers. To cater for scientific end users of the CO, an additional site is being established.32 The latter site uses the CO inventory page to index summary pages for ontology with links to available ontology files and link to other sites that are related to external ontologies. All of the information described on the website is freely available and all ontology flat files in OBO-format are also available to download.

Page 10: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

178

Future development plans include the further elaboration of GCP-compliant ontology data sources connected both to GCP-curated ontology repositories, and remote public repositories of ontology terms. The development of such data sources will fully integrate ontology management into the GCP platform for the integration of data distributed across the Internet.

GCP platform integration of the CO will eventually include software support for nomination of new CO terms during data entry into GCP-compliant data templates and databases. This mechanism will be particularly relevant to GCP-developed ontology for crop-specific phenotype and breeding experiments.

Semantic web implementation of GCP semanticsMapping of the GCP domain model and ontology into various Internet semantic protocols is in progress. The most mature of such mappings33 involve GCP data types designed for BioMoby (Wilkinson et al. 2008). Briefly, BioMoby data types inheriting from the Moby:Object are being manually designed to mirror the content and topology of the GCP domain model relatively faithfully, with the general modification that most complex attributes and association edges relating to model interfaces in the domain model are modelled as simple identifier (GCP_SimpleIdentifier) article names that can be used as inputs into additional web service to retrieved connected model data. Additional semantic network mappings have been partially prototyped, although not completed, in the TAPIR XML schema,34 OWL35 for use by SSWAP36 and onto the Genomic Diversity and Phenotype Connection (GDPC; Casstevens and Buckler, 2008).

GCP platform and network applicationsSome GCP platform applications using the CO are already in development. The GCP platform compliant ontology database adapter is in development for retrieving ontology data from the Chado database.

MIAME-Plant terms in the MIAME-Plant ontology have been incorporated into the GCP version of maxdLoad2.37 MaxdLoad is a component of the MAXD system which is an open source relational database schema and Java application for the annotation and storage of microarray data.38 MAXD is compliant with the standards for microarray metadata capture using the MGED ontology. However, to suit the purposes of GCP, it is being customised to reflect the standards set by the MIAME-Plant microarray ontology (Zimmermann et al, 2006).

A GCP data source adapter has been written in the Java Programming language used to communicate with BioMoby

web services, using GCP data types constructed within the Moby Dashboard and scaffold service code generated by the MoSES framework (Wilkinson et al, 2008). A Java language prototype of a GCP data source adapter to communicate with the GDPC protocol has also been achieved.

Outreach activitiesTo achieve the goals, the CO team have participated actively in a variety of conferences recently through poster presentations, oral presentations and workshops. In addition, face-to-face meetings with crop breeders, physiologists, pathologists and other scientists have been initiated to create awareness of ontology and to validate collected terms along with definitions. Efforts to engage the GCP and other related research communities are ongoing and will continue to be a high priority for the project to facilitate the improvement and utility of ontology for plant researchers.

ConclusionsCrop science and genetic resources practitioners span many diverse research communities, each with its favourite Internet communication protocols (eg, BioMoby, TAPIR, SSWAP, GDPC, etc). The value of the CO along with the GCP domain model and associated Pantheon software engineering framework is that it provides a common target semantics and software framework for global integration of data across these diverse community protocols. The GCP semantics are designed to be extensible, to allow for addition of new semantics and novel data types to the framework as the needs of crop research and development evolve.

Community involvement in the development of crop-specific ontology for anatomy and plant traits is being initiated with collaborators from several other institutions including ICRISAT, CIP, and the POC. For the development of crop ontology, priority will be given to ontology terms to describe crop drought tolerance experiments, a primarily scientific focus of the GCP. Generally, POC best practices will be followed for specifying ontology term nomenclature, definition and semantics, and phenotype terms will normally be defined as cross-products of PO, PATO and other related public ontology. The use of ontology terms in describing agronomic phenotypes and the ability to accurately map these descriptions into other resource databases will be an important step in gene discovery experiments

33 Documented at http://.pantheon.generationcp.org/ 34 http://www.tdwg.org/activities/tapir/35 http://www.w3.org/TR/owl-features/36 http://sswap.info/37 http://cropforge.org/projects/gcpmicroarray/38 http://bioinf.man.ac.uk/microarray/maxd/

Rosemary Shrestha et al

Page 11: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information

178 179

Casstevens T and Buckler E (2008) The Genomic Diversity and Phenotype Connection. http://www.maizegenetics.net/gdpc/.

Day-Richter J, Harris MA, Haendel M, The Gene Ontology OBO Edit Working Group and Lewis S (2007). Obo-Edit – An ontology editor for biologists. Bioinformatics 23:2198–2200.

Gkoutos GV, Green ECJ, Mallon A-M, Blake A, Greenaway S, Hancock JM and Davidson D (2004). Ontologies for the description of mouse phenotypes. Comparative and Functional Genomics 5:545–551.

Hughes LM, Bao J, Hu Z-L, Honavar V and Reecy JM (2008). Animal trait ontology: The importance and usefulness of a unified trait vocabulary for animal species. Journal of Animal Science 86:1485-1491.

Ilic K, Kellogg EA, Jaiswal P, Zapata F, Stevens PF, Vincent LP, Avraham S, Reiser L, Pujar A, Sachs MM, Whitman NT, McCouch SR, Schaeffer ML, Ware DH, Stein LD and Rhee SY (2007). The plant structure ontology, a unified vocabulary of anatomy and morphology of a flowering plant. Plant Physiology 143:587–599.

Jaiswal P, Ware D, Ni J, Chang K, Zhao W, Schmidt S, Pan X, Clark, K, Teytelman L, Cartinhour S, Stein L and McCouch S (2002). Gramene: development and integration of trait and gene ontologies for rice. Comparative and Functional Genomics 3: 132–136.

Mabee PM, Ashburner M, Cronk Q, Gkoutos GV, Haendel M, Segerdell E, Mungall C and Westerfield M (2007). Phenotype ontologies: the bridge between genomics and evolution. Trends in Ecology and Evolution 22:345-350.

Menda N, Buels RM, Tecle I and Mueller LA (2008). A community-based annotation framework for linking Solanaceae genomes with phenomes. Plant Physiology 147:1788-1799.

Wilkinson MD, Senger M, Kawas E, Bruskiewich R, Gouzy J, Noirot C, Bardou P, Ng A, Haase D, Saiz EA, Wang D, Gibbons F, Gordon PMK, Sensen CW, Carrasco JMR, Fernández JM, Shen L, Links M, Ng M, Opushneva N, Neerincx PBT, Leunissen JAM, Ernst R, Twigger S, Usadel B, Good B, Wong Y, Stein L, Crosby W, Karlsson J, Royo R, Párraga I, Ramírez S, Gelpi JL, Trelles O, Pisano DG, Jimenez N, Kerhornou A, Rosset R, Zamacola L, Tarraga J, Huerta-Cepas J, Carazo JM, Dopazo J, Guigo R, Navarro A, Orozco M, Valencia A, Claros MG, Pérez AJ, Aldana J, Rojano MM, Cruz RF, Navas I, Schiltz G, Farmer A, Gessler D, Schoof H and Groscurth A (2008). The BioMoby Consortium: Interoperability with Moby 1.0 – It’s better than sharing your toothbrush! Briefings in Bioinformatics 9:220–231.

Zimmermann P, Schildknecht B, Craigon D, Garcia-Hernandez M, Gruissem W, May S, Mukherjee G, Parkinson H, Rhee S, Wagner U and Hennig L (2006). MIAME/Plant - adding value to plant microarrray experiments. Plant Methods 9(2):1.

ReferencesAshburner M and Lewis SE (2002). On ontologies for

biologists: the Gene Ontology - uncoupling the web. Novartis Foundation Symposium 247:66-80 (discussion 80-83, 84-90, 244–252).

Avraham S, Tung C-W, Ilic K, Jaiswal P, Kellogg EA, McCouch S, Pujar A, Reiser L, Rhee SY, Sachs MM, Schaeffer M, Stein L, Stevens P, Vincent L, Zapata F and Ware D (2008). The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations. Nucleic Acids Research 36(Database issue): D449–D454.

Bioversity International (2009). FAO/IPGRI Multi-crop Passport Descriptors. Bioversity International, Rome, Italy, 4 pp. http://www.bioversityinternational.org/Publications/pubfile.asp?ID_PUB=124.

Borst WN (1997). Construction of engineering ontologies for knowledge sharing and reuse. PhD Dissertation, Dutch Graduate School for Information and Knowledge Systems, Enschede, the Netherlands.

Bruskiewich R, Senger M, Davenport G, Ruiz M, Rouard M, Hazekamp T, Takeya M, Doi K, Satoh K, Costa M, Simon R, Balaji J, Akintunde A, Mauleon R, Wanchana S, Shah T, Anacleto M, Portugal A, Ulat VJ, Thongjuea S, Braak K, Ritter S, Dereeper A, Škofič M, Rojas E, Martins N, Pappas G, Alamban R, Almodiel R, Barboza LH, Detras J, Manansala K, Mendoza MJ, Morales J, Peralta B, Valerio R, Zhang Y, Gregorio S, Hermocilla J, Echavez M, Yap JM, Farmer A, Schiltz G, Lee J, Casstevens T, Jaiswal P, Meintjes A, Wilkinson M, Good B, Wagner J, Morris J, Marshall D, Collins A, Kikuchi S, Metz T, McLaren G and van Hintum T (2008). The Generation Challenge Programme Platform: Semantic Standards and Workbench for Crop Science. International Journal of Plant Genomics 2008, Article ID 369601, 6 pp.

Crop ontology

Page 12: I.8 Development of crop ontology for sharing crop phenotypic …libcatalog.cimmyt.org/Download/cis/96031.pdf · I.8 Development of crop ontology for sharing crop phenotypic information