databases, ontologies and text mining session introduction part 2 carole goble, university of...

29
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip Bourne, SDSC/UCSD, USA [email protected]

Upload: marilyn-freeman

Post on 18-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Databases, Ontologies and Text mining

Session IntroductionPart 2

Carole Goble, University of Manchester, UKDietrich Rebholz-Schuhmann, EBI, UK

Philip Bourne, SDSC/UCSD, [email protected]

Page 2: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

UniP

rot

The Gene O

ntology

Ontologies

DatabasesApplications

and Mining

Bioinformatics

LocusLink

Text

min

ing

Knowledge mining

Resources in Bioinformatics

Page 3: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

UniP

rotDatabases

Bioinformatics

LocusLink

Resources in Bioinformatics

Page 4: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

What perspective do I bring?

Page 5: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Preface

• A review of the state and needs of the field from the perspective of a user of biological databases….

… the p53 core domain structure consists of a ß sandwich that serves as a scaffold for two large loops and a loop-sheet- helix motif ... ----Science Vol.265, p346

1TSR

Corresponding structure from the PDB

?Oops!

ß sandwich? Where?Large loop? Which one??

Loop-sheet-helix???

Page 6: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Preface

• A review of the state and needs of the field from the perspective of a developer of biological databases….

Page 7: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

What are the current biological databases and what does this tell

us?

Page 8: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Large Growth in the Number of Biological Databases

NAR Database Issue

0

100

200

300

400

500

600

1996 1997 1998 1999 2000 2001 2002 2003 2004

Year

Nu

mb

er o

f E

ntr

ies

Page 9: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Resources are Becoming More Diverse Database Types

Nucleotide Sequence

RNA Sequence

Protein Sequence

Structure

Genome (non-human)

Pathways

Genome (human)

Disease

Gene Expression Other

NAR 2004 – Division by Resource Type

Page 10: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

NAR 2004 – A Closer Look

Database Types

Nucleotide Sequence

RNA Sequence

Protein Sequence

Structure

Genome (non-human)

Pathways

Genome (human)

Disease

Gene Expression Other

• Genome scale databases have proliferated

• Traditional sequence databases are now a small part

• Databases around new specific data types are emerging

• Pathway and disease orientated databases are emerging

Page 11: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

The Future - ISMB04 Poster Distribution

Nucleotide

Sequence

RNA Sequence

Protein Sequence

Structure

Genome (non-

human)

Pathways

Genome (human)Disease

Gene Expression

Other

Database Types

Nucleotide Sequence

RNA Sequence

Protein Sequence

Structure

Genome (non-human)

Pathways

Genome (human)

Disease

Gene Expression Other

ISMB04

Page 12: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

What Does ISMB04 Tell Us About New Biological Databases?

• Microarray data resources are hot• Genotypic – phenotypic resources are

emerging• Surprisingly pathway resources are not

growing fast • Disease and species based resources are

increasing – notably plants• Human genome related resources are

increasing

Page 13: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

What About Data in These Databases?

Page 14: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Data are Becoming More Plentiful and More Complex

Page 15: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Note: Redundancy at 30% Sequence Identity

Data are Becoming More Redundant

Page 16: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

So the amount and complexity of data are increasing across biological scales – what are the challenges?

Page 17: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

A Major Challenge

12:00

We suffer from the “high noon syndrome”

Those who can gain and contribute most to biological databasesare frequently NOT the users

We need to lower the cost:benefit ratio

Page 18: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

How Do We Lower this Barrier?

• Better support of complex data types e.g., networks, images, graphs

• Associated optimized query languages

• Associated ontologies

• Better handling of uncertainty and inconsistency

• More and automated data curation

• Large scale data integration

Page 19: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

How Do We Lower this Barrier?

• Better support of complex data types e.g., networks, images, graphs

• Associated optimized query languages

• Associated ontologies

• Better handling of uncertainty and inconsistency

• More and automated data curation

• Large scale data integration

Page 20: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

How Do We Lower this Barrier?

• Support of data provenance

• Support for rapid data and associated schema evolution

• Support for temporal data

• Better integration of data and methods

• Usability engineering

Page 21: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

How Do We Lower this Barrier?

• Support of data provenance

• Support for rapid data and associated schema evolution

• Support for temporal data

• Better integration of data and methods

• Usability engineering

We need more work in these other areas

Page 22: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

A Note on Data Provenance

Page 23: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Further Reading

• Jagadish and Olken (2003) Omics 7(1) 131-137. Data Management for Life Sciences Research http://www.lbl.gov/~olken/wmdbio

• Maojo and Kulikowski (2003) J. of AMIA 515-522. Bioinformatics and Medical Informatics – Collaborations on the Road to Genomic Medicine?

Page 24: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

GeneXPress: A Visualization and Statistical Analysis Tool for Gene Expression and

Sequence DataSegal, Kaushal, Yelensky, Pham, Regev, Koller,

Friedman

DataQuery &Analysis

BiologicalResults

Curation

Usability Integration

• Assign biological meaning to gene expression data through post-processing and visualization

Page 25: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Filtering Erroneous Protein AnnotationWieser, Kretschmann and Apweiler

DataQuery &Analysis

BiologicalResults

Curation

Usability Integration

• Automated detection of annotation errors using a decision tree approach based upon the C4.5 data mining algorithm

Page 26: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Selecting Biomedical Data Sources According to User Preferences

Cohen-Boulakia, Lair, Stransky, Graziani, Radvanyi, Barillot and Froidevaux

DataQuery &Analysis

BiologicalResults

Curation

Usability Integration

• Understand the characteristics of biological data

• Present a selection of resources relevant to a user query

• Framework for the multiple parametric analysis of cancer

Page 27: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Integration of Biological Data from Web Resources: Management of Multiple Answers through Metadata Retrieval

Devignes, Smail

DataQuery &Analysis

BiologicalResults

Curation

Usability Integration

• Same question – different answers from different resources – How can this be understood?

• Semantic integration based on domain ontologies

Page 28: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

Critically-based Task Composition in Distributed Bioinformatics Systems

Karasavvas, Baldock, Burger

Data Query &Analysis

BiologicalResults

Curation

Usability Integration

• Task composition in workflow systems requires decision support

• Provision of data providing providence information provides that support

Page 29: Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip

ENJOY !!