database issues in nutritional genomics

38
Database Issues in Nutritional Genomics Tony Travis & Peter Gray Rowett Research Institute & University of Aberdeen Jan 2005 Nu Nu GO G O Nu Nu GO G O

Upload: merry

Post on 14-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Tony Travis & Peter Gray Rowett Research Institute & University of Aberdeen Jan 2005. Database Issues in Nutritional Genomics. Un Oslo. Rowett. Un. Ulster. Un Newcastle. Un Lund. Trinity. DiFE. IFR. Un Cork. EBI. Rivm. Rikilt. TNO. Un Reading. Un Wageningen. Un Maastricht. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Database Issues in Nutritional Genomics

Database Issues in Nutritional Genomics

Tony Travis &Peter Gray

Rowett Research Institute &University of Aberdeen

Jan 2005

NuNuGOGONuNuGOGO

Page 2: Database Issues in Nutritional Genomics

Un Oslo

Un Munich

Un Florence

Un Balearic Illes

Un Cork

Trinity

Un. Ulster

Rowett

Un Newcastle

Un Reading

IFR DiFE

Un Krakow

Inserm Marseille

TNO

Un Wageningen

Un Maastricht

EBI

NuNuGOGO

Un Lund

RikiltRivm

Page 3: Database Issues in Nutritional Genomics

Utopian view

• Share data freely• Everyone benefits• Ideas develop• Science prospers

Page 4: Database Issues in Nutritional Genomics

Big pharma disagree!

• Sell data commercially

• Big pharma benefits• Ideas are exploited• Science is a business

Page 5: Database Issues in Nutritional Genomics

Scientists are confused…

• Intellectual freedom?– Curiosity driven

science– Poor funding

• Intellectual property?– Commercially driven

science– Good funding

Page 6: Database Issues in Nutritional Genomics

Preserving intellectual property

• Autonomy– Scientist or institution

control who their data is disclosed to

– Control data sharing by collaborators who share their IP

– Needs federated solution

• Security– Prevent unauthorised

access to data– Prevent unauthorised

use of data– Maintain integrity and

provenance of data

Page 7: Database Issues in Nutritional Genomics

Typical NutriGenomics Use Case

• Example of pragmatic solution – DNA microarray work at RRI

• Autonomy– Data held locally on PC spreadsheets– Completely under control of investigator

• Collaborators– Each create spreadsheet of local results– All collaborators exchange spreadsheets

Page 8: Database Issues in Nutritional Genomics

Spreadsheet microarray data

Page 9: Database Issues in Nutritional Genomics

Distribution of one spreadsheet

D C

A B

Page 10: Database Issues in Nutritional Genomics

Exchange of all spreadsheets

D C

A B

Page 11: Database Issues in Nutritional Genomics

Manual replication of database

• Advantages– Simple peer-to-peer

data transfer via email

– Each collaborator has entire database locally

– Local analysis tools are readily available

– Complete control of IP within collaboration

• Disadvantages– N(N-1) solution– Does not scale well– Each collaborator

must merge data into local database replica

– No control over data integrity or provenance

Page 12: Database Issues in Nutritional Genomics

Spreadsheet Replicated Data Model

• Distributed– Data originates at each collaborator’s site

• Replicated– Copy of the entire database at each site

• Manually updated– Data and corrections are pushed from each

collaborator to all others via email of Excel spreadsheets containing expression data which is merged into a single spreadsheet

Page 13: Database Issues in Nutritional Genomics

Local analysis tools: maxd

• Microarray Bioinformatics Group University of Manchester (UK)

• Java-based• maxdView

– Visualise and analyse gene expression data.

• maxdLoad2– Store and curate gene expression data to MIAME

standards

• Export in MAGE/ML format for submission to ArrayExpress.

Page 14: Database Issues in Nutritional Genomics

Import spreadsheet data into maxd

Page 15: Database Issues in Nutritional Genomics

Analyse expression profiles

• 10,000 genes• Four experiments by

one collaborator• Normalised• Clustered• Comparison of gene

expression profiles between experiments

Page 16: Database Issues in Nutritional Genomics

Upgrade spreadsheet solution

• MaxdLoad2– Replace spreadsheets– Use MIAME standard– JDBC compliant interface– SQL92 (MySQL, Postgres)

Page 17: Database Issues in Nutritional Genomics

Candidate Mediator middleware

• Maxd– Designed for use with single database

• P/FDM– Integration of heterogeneous data sources– Federated union/join of relations

• Biomart– MartShell scripting language– Federate database instances

Page 18: Database Issues in Nutritional Genomics

Example federated DB

Page 19: Database Issues in Nutritional Genomics

MartShell

• Command line (text mode) user Interface to BioMart that can be used by programs

• Mart Query Language (MQL)

• Queries can be executed in ‘batch’ mode using stored procedures in MQL scripts

Page 20: Database Issues in Nutritional Genomics

BioArray Software Environment

• BASE is a comprehensive database server to manage massive amounts of data generated by microarray analysis

• Lund University +Oklahoma University

• Data can be analysed using a web-based GUI to server-side PHP scripts or data can be extracted from the BASE database by applications such as Genespring

Page 21: Database Issues in Nutritional Genomics

Querying a Federated DB

There are two kinds of distributed query that you can send out to the federation:

• Federated Join - like adding extra columns with cross-referenced information on the same object or related objects.

• Federated Union – like adding extra rows with the same column headings – the same kinds of experiments but done at different sites.

Page 22: Database Issues in Nutritional Genomics

Comparing expression profiles(e.g.looking for co-regulation)

GeneA GeneB GeneC GeneD GeneE Gene…Lab 1 Rat 1 1.488212 1.152023 6.957593 1.678806 2.172907 …

Rat 2 0.374015 1.746107 4.191758 2.666562 3.184208 …Rat 3 0.642141 1.260019 1.079844 1.651717 1.359549 …

Lab 2 Rat 4 1.032028 1.320651 5.806003 1.389428 3.625239 …Rat 5 1.482212 1.157023 6.857593 1.678806 2.142907 …Rat 6 1.291634 1.06932 1.061052 2.083518 1.146157 …

Lab 3 Rat 7 1.808716 2.388491 1.47649 0.412969 2.225646 …Rat 8 1.217205 1.114725 1.218257 2.560339 2.202825 …Rat... … … … … …

Lab...

Page 23: Database Issues in Nutritional Genomics

Conditions for making a Federated DB work

Needs Common Ontologyfor data of same type. BEWARE measurements made in different units,or using a very different exptl. procedure,or qualitative measurements such as "large".."medium"

Page 24: Database Issues in Nutritional Genomics

Conditions for making a Federated DB work

Need Common Unique Identifiers :if no property allows you to tell that one entity instance is the same as another then integration is UNSAFE!

(Note - it might be OK for say 95 percent of identifiers...)

Page 25: Database Issues in Nutritional Genomics

Conditions for making a Federated DB work

Mechanisation of Value mapping :• if data values can only be compared or made

compatible with others using the judgement of an experienced scientist, then one must use a Warehouse (as in early PDB), otherwise

• if you can mechanise it using rules or equations then it can be done by a view,

• or by a mediator accessing the Federation

Page 26: Database Issues in Nutritional Genomics

Conditions for making a Federated DB work

Need Standard Interchange Formats :• Formats such as MMCIF helped reduce

human intervention in PDB. The widely used MIAME format may do the same for MicroArray Data.

• However such data is much harder to integrate as it may be measured under different conditions with different technology.

Page 27: Database Issues in Nutritional Genomics

Difficulties of Federated Approach

• Reliability - Sites must be availablecontinuously, and not crash too often;

• Support costs - must be proof against Virus attacks, etc., and have people able to bring them back up again promptly

Page 28: Database Issues in Nutritional Genomics

Difficulties of Federated Approach

• Compatibility - must provide a common interface - may be able to share development of some downloadable server software (like Java WebStart), responding to SOAP protocol messages and commands, config-urable through web forms that keeps logs of errors.

Page 29: Database Issues in Nutritional Genomics

Difficulties of Federated Approach

• Performance: Warehouses will provide better performance for data mining programs and others programs with a high hit rate.

• Federated systems compete well on more focused queries which allow the use of indexes in remote systems.

Page 30: Database Issues in Nutritional Genomics

Having it Both Ways:• A Federated Solution can include some sites

that are adopting Warehouse technology to collect and vet large volumes of data of a particular kind.

• The NUGO data model and ontologies are bound to change a lot in ways we cannot forsee. Thus it makes sense to be flexible to start, allowing site autonomy, and to delay committing to large warehouses until we understand more about the data model and IPR issues.

Page 31: Database Issues in Nutritional Genomics

Discovering the Model

Birney & Clamp (2004) say – "the true biological interpretation of data stored in a database will change over time, and discovering new relationships between aspects of the data is an important part of the motivation for storing it..”

Page 32: Database Issues in Nutritional Genomics
Page 33: Database Issues in Nutritional Genomics

Conclusion (1) - Spreadsheets

• Spreadsheets are easy and popular• Integrating Spreadsheets manually is

time wasting and can easily lead to errors and wrong conclusions

• Scientists need the discipline of a shared Data Model and the automation of data transfer and conversion, usually provided by a Mediator

Page 34: Database Issues in Nutritional Genomics

Conclusion (2) – Shared Data Model

• Agreement on a shared Ontology is mainly a problem of agreeing Standards for names, units, and specialised types.

• Agreeing a shared Data Model is more subtle. It may need experimentation in advance of a standard.

• The Data Model, based on Entity-Relationship Model with SubTypes, must be able to evolve - not fixed in stone, coping with the unforseen.

Page 35: Database Issues in Nutritional Genomics

Conclusion (2) – Shared Data Model

• The Data Model must be at Conceptual Level - independent of Storage Technique - arrays, ASN-1, XML, tables etc... Otherwise agreeing a Shared Model becomes too hard!

• The Data Model must provide External Views both to restrict access and to provide a consistent API to External Applications; these may be Spreadsheets or Statistical Packages or MaxD or Genespring etc...

Page 36: Database Issues in Nutritional Genomics

Conclusion (3) – Federating Microarray Data

• Usually, a federation is based on a federated Join, through common identifiers, because irrelevant joins can be left out, to speed up the query.

• Federated Joins suit integrating other types of data with Microarray data, e.g. physiological, epidemiological data

• This is easily done, on the fly; it allows us to evolve the data model and experiment with it without making changes to a centralised warehouse. Once the data model is more stable, parts of it can be stored in warehouse.

Page 37: Database Issues in Nutritional Genomics

Conclusion (3) – Federating Microarray Data

• Queries that want to compare Gene Expression Profiles across many Experiments need a federated Union of data from different experimenters.

• Comparing one profile against those from many experimental sites could be done in parallel. Trusted methods could work with an encrypted profile to keep it confidential.

Page 38: Database Issues in Nutritional Genomics

Conclusion (4) – IPR and Federation

• Scientists want to retain their autonomy and right to recognised authorship of the data, otherwise they may not share it!

• If Database Right (EU proposal) becomes established, scientists may wish to keep data in their own DB in order to take advantage of it. Thus we may need to make more use of federated techniques to bring such data together.

• Revenue-Raising Potential may become important (iTunes for example).