creating and querying an integrated ontology for molecular and phenotypic cereals data

24
<mtsr>07 - Special Session on Agricultural Metadata & Semantics Antonio Sala - Università di Modena e Reggio Emilia 1 www.dbgroup.unim o.it Creating and Querying an Creating and Querying an Integrated Ontology for Integrated Ontology for Molecular and Phenotypic Molecular and Phenotypic Cereals Data Cereals Data Special Session on Agricultural Metadata & Semantics 2nd International Conference on Metadata & Semantics Research October 10-11, 2007 - Corfu, Greece Sonia Bergamaschi, Sonia Bergamaschi, Antonio Sala Antonio Sala www.dbgroup.unimo.it www.dbgroup.unimo.it DII - Dipartimento di Ingegneria dell’Informazione DII - Dipartimento di Ingegneria dell’Informazione Università di Modena e Reggio Emilia, via Vignolese 905, 41100 Università di Modena e Reggio Emilia, via Vignolese 905, 41100 Modena Modena : : ect NeP4B: Networked Peers for Business (www.dbgroup.unimo.it/nep4b) and IST FP6 STREP project STASIS (www.dbgroup.unimo. ect NeP4B: Networked Peers for Business (www.dbgroup.unimo.it/nep4b) and IST FP6 STREP project STASIS (www.dbgroup.unimo.

Upload: kathleen-tanner

Post on 30-Dec-2015

20 views

Category:

Documents


1 download

DESCRIPTION

Special Session on Agricultural Metadata & Semantics 2nd International Conference on Metadata & Semantics Research October 10-11, 2007 - Corfu, Greece. Creating and Querying an Integrated Ontology for Molecular and Phenotypic Cereals Data. Sonia Bergamaschi, Antonio Sala - PowerPoint PPT Presentation

TRANSCRIPT

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia1

www.dbgroup.unimo.it

Creating and Querying an Creating and Querying an Integrated Ontology for Integrated Ontology for

Molecular and Phenotypic Molecular and Phenotypic Cereals DataCereals Data

Special Session on Agricultural Metadata & Semantics

2nd International Conference on Metadata & Semantics Research

October 10-11, 2007 - Corfu, Greece

Sonia Bergamaschi, Sonia Bergamaschi, Antonio SalaAntonio Sala

www.dbgroup.unimo.itwww.dbgroup.unimo.it

DII - Dipartimento di Ingegneria dell’InformazioneDII - Dipartimento di Ingegneria dell’Informazione

Università di Modena e Reggio Emilia, via Vignolese 905, 41100 ModenaUniversità di Modena e Reggio Emilia, via Vignolese 905, 41100 Modena

Funded by:Funded by:

FIRB project NeP4B: Networked Peers for Business (www.dbgroup.unimo.it/nep4b) and IST FP6 STREP project STASIS (www.dbgroup.unimo.it/stasis)FIRB project NeP4B: Networked Peers for Business (www.dbgroup.unimo.it/nep4b) and IST FP6 STREP project STASIS (www.dbgroup.unimo.it/stasis)

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia2

www.dbgroup.unimo.itMotivationsMotivations

• Numerous public data sources have been realized and are now available for researchers in the field of molecular biology

• Problem of having access to this great amount of data:– Numerous sources– Heterogeneous interfaces and structures– Low IT skills of the users

• What is needed:– Extracting and fusing information from different data sources– Presenting the information according to a unique interface in a

transparent and easy way independently from the format of the different sources

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia3

www.dbgroup.unimo.it

CEREALAB projectCEREALAB project

• Conducted by the Agrarian faculty of the University of Modena and Reggio Emilia funded by the Regional Government of Emilia Romagna

• The aim is to perform intelligent data integration of existing databases, i.e. to create a Global Virtual Schema (GVV) for the genotypic selection of cereal cultivars

• Genotypic selection of cereal cultivars:– To extract genotypic data correlated to phenotypic traits

• 3 species involved: – Wheat– Barley– Rice

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia4

www.dbgroup.unimo.it

The MOMIS SystemThe MOMIS System

CEREALABGraingenes GRIN

MOMIS (Mediator Environment for Multiple Information Sources) (www.dbgroup.unimo.it/Momis) is a framework to perform information extraction and integration from heterogeneous distributed data sources and query management

Gramene CRA ERData

GlobalVirtual View

(GVV)

gene FHB

Wrapper Wrapper Wrapper Wrapper Wrapper Wrapper

gene genegene FHB FHBFHB

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia5

www.dbgroup.unimo.it

Integration Process OverviewIntegration Process Overview

SYNSET2

SYNSET#

SYNSET4

SYNSET1

AUTOMATIC/MANUALANNOTATION

SEMI-AUTOMATICANNOTATION

INFERRED RELATIONSHIPS

LEXICON DERIVEDRELATIONSHIPS

SCHEMA DERIVEDRELATIONSHIPS

CommonThesauru

s

COMMON THESAURUSGENERATION

USER SUPPLIEDRELATIONSHIPS

GVV GENERATION(CEREALAB)

MAPPINGTABLES

GLOBAL CLASSES

clustersgeneration

ODLI3LOCAL SCHEMA 3

ODLI3LOCAL SCHEMA 1

ODLI3LOCAL SCHEMA 2

WRAPPING

Structuredsource

GRIN

Structuredsource

CEREALAB

Structuredsource

Graingenes OW

LExport

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia6

www.dbgroup.unimo.it

Data SourcesData Sources

• Molecular data:– Gramene, Relational DB (www.gramene.org)– Graingenes, Relational DB (wheat.pw.usda.gov/GG2)– CEREALAB experimental data, Relational DB

• Phenotypic Data:– GRIN, Excel Files (www.ars-grin.gov)– CEREALAB repository, Relational DB (created collecting data from

specific literature for regional germplasms and from the italian National Council of Research in Agriculture (CRA))

• All these data sources, if considered separately, present incomplete information for the purpose of the CEREALAB project and are sometimes overlapping

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia7

www.dbgroup.unimo.it Local Source Schemata AnnotationLocal Source Schemata Annotation

Local Source Annotation• To assign meanings (by WordNet) to each local class and attribute name of

a local schema• Semi-automatically performed• a WordNet Editor is available to extend WordNet by adding new domain-

dependent terms and synsets • This extension step has to be performed just the first time a domain is

handled.

Automatic Annotation(Present in WordNet)

Manual Annotation(Not present in WordNet)

Gene: a segment of DNA that is involved in producing a polypeptide chain; it can include regions preceding and following the coding DNA as well as introns between the exons; it is considered a unit of heredity) "genes were formerly called factors"

Marker: some conspicuous object used to distinguish or mark something "the buoys were markers for the channel"Marker: A segment of DNA with an identifiable physical location on a chromosome for any feature that has been genetically mapped

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia8

www.dbgroup.unimo.it

Common Thesaurus GenerationCommon Thesaurus Generation

• MOMIS constructs a Common Thesaurus including SYN, BT/NT, and RT relationships among schema elements.

• The Common Thesaurus is constructed through an incremental process in which the following relationships are added:– schema-derived relationships– lexicon-derived relationship– designer-supplied relationships– inferred relationships

• As an example:gene is identified as a BT of allele (as gene is a direct hypernym of allele)marker is identified as a NT of gene (as genetic marker as been added as

a direct hyponym of gene)

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia9

www.dbgroup.unimo.it

Global Virtual Schema GenerationGlobal Virtual Schema GenerationMOMIS identifies and groups classes that describe the same or semantically related concept in different sources into clusters (global classes)

Source, Reference: a publication (or a passage from a publication that is referred to

gene(Global) gene(CEREALAB) gene(Gramene)

allele allele allele

locus locus

name(Join) name name

reference title reference

… … …

Mappings are generated among global and local classes in the cluster (according to a GAV approach)A Mapping Table (MT) is automatically generated for each global class of a GVV

The designer may interactively refine and complete the proposed integration by adding Data Conversion Functions from local to global attributes or Resolution Functions for global attributes to solve data conflicts of local attribute values

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia10

www.dbgroup.unimo.it

Join ConditionsJoin Conditions

(CEREALAB.gene.name) = (gramene.gene.name)

• Object identification: merging data from different sources requires different instantiations of the same real world object to be identified

• Join Conditions among pairs of local classes belonging to the same global class allow to identify instances of the same object and fuse them.

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia11

www.dbgroup.unimo.it

The Global Virtual Schema (GVV)The Global Virtual Schema (GVV)

• Each GVV element is automatically annotated w.r.t. WordNet and can be exported in OWL

• The GVV can be seen as an Ontology of the underlying sources

• This Ontology correlates the molecular data of Gramene, Graingenes and the CEREALAB project with the phenotypic data of the GRIN database and those collected by the CEREALAB project

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia12

www.dbgroup.unimo.it

The Integrated OntologyThe Integrated Ontology• Molecular data associated with germplasms and phenotypic

evaluations

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia13

www.dbgroup.unimo.it

The Integrated OntologyThe Integrated Ontology

Phenotypic data are divided into six categories chosen among those of major interest for the cereal breeders:

– Abiotic Stress– Biotic Stress– Growth and Development related traits– Quality traits– Yield traits.

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia14

www.dbgroup.unimo.it

The Integrated OntologyThe Integrated Ontology

• For each trait the specific evaluation of a germplasm for that trait is available

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia15

www.dbgroup.unimo.it

The Integrated OntologyThe Integrated Ontology

• Molecular data are related to phenotypic data indicating their presence in a germplasm for which a quantitative phenotypic evaluation is available

• Information about specific molecular markers that can identify genes or QTLs that express a particular phenotypic trait.

• In this way genotypic selection of cereals cultivars can be performed starting from phenotypic data

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia16

www.dbgroup.unimo.it

Querying the Integrated OntologyQuerying the Integrated Ontology

The MOMIS Query Manager is the coordinated set of functions which takes an incoming query (say global query) and performs the following steps:

• Query rewriting – to rewrite a global query as an equivalent set of queries expressed on

the local sources (local queries)• Local queries execution

– the local queries are sent and executed on local sources• Fusion and Reconciliation

– The local answers are fused into the global answer

• A Graphical User Interface has been developed to compose queries over the GVV regardless of the specific languages of the source databases.

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia17

www.dbgroup.unimo.it

ExampleExample“Find information about wheat QTLs that express resistance to the Fusarium fungus”

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia18

www.dbgroup.unimo.it

ExampleExample“Find information about wheat QTLs that express resistance to the Fusarium fungus”

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia19

www.dbgroup.unimo.it

ExampleExample“Find information about wheat QTLs that express resistance to the Fusarium fungus”

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia20

www.dbgroup.unimo.it

ExampleExample“Find information about wheat QTLs that express resistance to the Fusarium fungus”

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia21

www.dbgroup.unimo.it

ExampleExample“Find information about wheat QTLs that express resistance to the Fusarium fungus”

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia22

www.dbgroup.unimo.it

ExampleExample“Find information about wheat QTLs that express resistance to the Fusarium fungus”

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia23

www.dbgroup.unimo.it

ExampleExample“Find information about wheat QTLs that express resistance to the Fusarium fungus”

<mtsr>07 - Special Session on Agricultural Metadata & Semantics

Antonio Sala - Università di Modena e Reggio Emilia24

www.dbgroup.unimo.itConclusionsConclusions

• The MOMIS system allows a straightforward creation of a Global Virtual Schema to integrate data from the CEREALAB research project with data coming from the databases Gramene, Graingenes and GRIN

• The integration process provides a unique interface for the 3 sources according to a common ontology

• Querying the 3 sources results completely transparent and easy to the user through a GUI

• A unique answer is obtained