b4os-2012

79
Data management and curation: the other side of bioinformatics Susanna-Assunta Sansone, PhD Principal Investigator and Team Leader, University of Oxford e-Research Centre, Oxford, UK http://uk.linkedin.com/in/sasansone Bioinformatics for Omics Sciences (B4OS), CNR Naples, 25-17 Sep 2012 http://www.slideshare.net/SusannaSansone/B4OS-2012

Upload: susanna-assunta-sansone

Post on 27-Jan-2015

109 views

Category:

Education


0 download

DESCRIPTION

Bioinformatics for Omics Sciences (B4OS), CNR Naples, 25-17 Sep 2012

TRANSCRIPT

Page 1: B4OS-2012

Data management and curation:

the other side of bioinformatics

Susanna-Assunta Sansone, PhD Principal Investigator and Team Leader,

University of Oxford e-Research Centre, Oxford, UK

http://uk.linkedin.com/in/sasansone

Bioinformatics for Omics Sciences (B4OS), CNR Naples, 25-17 Sep 2012

http://www.slideshare.net/SusannaSansone/B4OS-2012

Page 2: B4OS-2012

Oxford e-Research Centre

Page 3: B4OS-2012

Oxford e-Research Centre

Page 4: B4OS-2012

Providing research computing, high-performance computing

Integrating with national and international infrastructure

Supporting leading edge facilities through education and training

Oxford e-Research Centre

Page 5: B4OS-2012

Oxford e-Research Centre

Collaborating with European and wider international groups in, e.g.:

•  energy, •  radio astronomy, •  biological data federation, •  life sciences simulation, •  biodiversity, •  computational chemistry, •  neuroscience, •  digital humanities tools, •  digital music analysis

Research in •  computation, •  data infrastructure and analysis, •  visualisation

Page 6: B4OS-2012

tox/pharma  

env  

health  

agro  

My team’s activities and groups we work with

data management, biocuration, development of software, databases and community-driven standards and ontology

Page 7: B4OS-2012

http://www.flickr.com/photos/12308429@N03/4957994485/ CC BY

Page 8: B4OS-2012

Today:

“The buzz around reproducible bioscience data -

the policies, the communities and the standards”

Thursday:

“The reality from the buzz: how to deliver

reproducible bioscience data”

Page 9: B4OS-2012

9

Harmonize collection across sites Find matching studies

Data dissemination Long-term data stewardship

Preserve institutional /

corporate memory

Page 10: B4OS-2012

10

Utilize public data

Identify suitable data Retrieve

Curate and harmonize Re-analyze

Page 11: B4OS-2012

11

Address reproducibility /

reuse of public data

Page 12: B4OS-2012

12

Address reproducibility /

reuse of public data

Page 13: B4OS-2012

13

Ioannidis et al., Repeatability of published microarray gene expression analyses. Nature Genetics 41(2), 149-55 (2009) doi:10.1038/ng.295

Address reproducibility /

reuse of public data

Page 14: B4OS-2012

14

14

Address reproducibility /

reuse of public data

Page 15: B4OS-2012

15

Address reproducibility /

reuse of public data

15

Page 16: B4OS-2012

16

16

Address reproducibility /

reuse of public data

Page 17: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

17

Growing, worldwide movement for reproducible research

“Publicly-funded research data are a public good, produced in the public interest”

“Publicly-funded research data should be openly available to the maximum extent possible”

Shared, annotated research data and methods offer new discovery opportunities and prevent unnecessary repetition of work.

Improved data sharing underpins science of the future

Page 18: B4OS-2012

http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY

Page 19: B4OS-2012

Reproducible & Reusable

Bioscience Research

Page 20: B4OS-2012

Reproducible & Reusable

Bioscience Research

Well-annotated & Structured Data

reasoning

analysis

exchange

integration

visualization

browsing retrieval

Page 21: B4OS-2012

Reproducible & Reusable

Bioscience Research

Well-annotated & Structured Data

reasoning

analysis

exchange

integration

visualization

browsing retrieval

Community Standards

Software Tools

Page 22: B4OS-2012

Source of the figure: EBI website

§  Is interdisciplinary and integrative in character •  need to deal with new and existing datasets •  deal with a variety of data types

§  ‘How the organism works’ is the focus •  Twenty years ago data was the center

Experimental and

computational data

Publications

Today’s bioscience research

Page 23: B4OS-2012

Example from the toxicogenomics domain

Study looking at the effect of a compound inducing liver damage by characterizing/measuring

- the metabolic profile by MS and NMR

- protein expression in liver by MS

- gene expression by DNA microarray

-  conducting genetic and phenotypical analysis

Information contributing to the construction and validation of system biology models

Page 24: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

24

Example of experiments by InnoMed PredTox a FP6 public-private consortium

Page 25: B4OS-2012

§  Capture all salient features of the experimental workflow

§  Make annotation explicit and discoverable

§  Structure the descriptions for consistency, tracking §  independent variables §  dependent variables using §  cross reference and

resolvable identifiers

Structured description of datasets

Page 26: B4OS-2012

§  We must strike a balance between •  depth and breadth of

information; and •  sufficient information

required to reuse the data

Not too much, not too little, just ‘right’

Page 27: B4OS-2012

Information intensive experiments

Page 28: B4OS-2012

To make the experiments comprehensible and reusable,

underpinning future investigations, we need

common ways to report and share the experimental details and the associated data.

Consistent reporting will have a positive and long-lasting impact

on the value of collective scientific outputs.

Information intensive experiments

Page 29: B4OS-2012

§ The challenges we face

•  Large in volume: lots of data types and metadata! •  Lots of free text descriptions: hard to mine, subject to mistakes! •  Babel of terminologies: lack of definitions, hard to map! •  Heterogeneous file formats: software lock-in!

§ Need for reporting standards •  Minimal reporting descriptors

- Report the same ‘core essentials’ •  Controlled vocabularies or ontology

- Use the same word and mean the same thing •  Common exchange formats

- Make tools interoperable, allow data exchange and integration

Common ways to report and share

Page 30: B4OS-2012

§  Describe and communicate the information to others, in an unambiguous manner

§  To unlock the value in the data •  Compare, query and evaluate data

- Facilitate scientific validation of the findings •  Understand variability within/between different technologies and

protocols -  Facilitate technical validation -  Enable optimization of the experimental designs -  Identify critical checkpoints and develop quality metrics

§  To define submission and/or publication requirements •  Journals •  Databases

§  To ensure data integrity, reproducibility and (re)use

Reporting standards – the benefits

Page 31: B4OS-2012

Genome annotation www.geneontology.org

Functional Genomics Data Society (FGED)

www.fged.org

HUPO- Proteomics Standards Initiative (PSI)

http://www.psidev.info

Cheminformatics www.ebi.ac.uk/chebi

Pathways www.biopax.org

Systems modelling standards

www.sbml.org

Metabolomics Standards Initiative (MSI) http://www.metabolomicssociety.org

Genomics Standards Consortium (GSC)

gensc.org

Escalating number of standardization efforts in bioscience, e.g.:

Enzymology data standards

www.strenda.org

Page 32: B4OS-2012

Different community, different norms and standards, e.g.:

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

Page 33: B4OS-2012

Different community, different norms and standards, e.g.:

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

Page 34: B4OS-2012

Different community, different norms and standards, e.g.:

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

Challenges: lack of coordination, fragmentation and uneven coverage

Page 35: B4OS-2012

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

Is this ‘general mobilization’ good or bad?

§  Difference in structures and processes: •  organization types (open, close to members, society, WG…) •  standards development (how to design, develop, evaluate, maintain…) •  adoption, uptake, outreach (link to journals, funders, commercial sector…) •  funds (sponsors, memberships, grants, volunteering…)

Page 36: B4OS-2012

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

§  Fragmentation of the standards is a major issue •  Being focused on particular communities’ interests, be their individual

technologies or biological/biomedical disciplines, leads to duplication of effort, and more seriously, the development of (largely arbitrarily) different standards

•  This severely hinders the interoperability of databases and tools and ultimately the integration of datasets

Is this ‘general mobilization’ good or bad?

Page 37: B4OS-2012

VO!

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

Growing number of reporting standards

Page 38: B4OS-2012

Growing number of reporting standards

+ 130

Estimated

+ 150

Source: MIB

BI,

EQU

ATOR

+ 303

Source: BioPortal

Databases, annotation,

curation tools

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!

Page 39: B4OS-2012

But how much do we know about these standards

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!

Page 40: B4OS-2012

Which one are mature enough for

me to use or recommend?

I work on plants, are these just for

biomedical applications?

What are the criteria to evaluate

their status and value?

How can I get involved to

propose extensions or modifications?

Which tools and databases

implement which standards?

I use high throughput sequencing technologies, which one are applicable

to me?

But how much do we know about these standards

Page 41: B4OS-2012

§  A bewildering array of standards is available, but

•  these are hard to find, at different levels of maturity; in

some areas duplications or gaps in coverage also exist

§  Standards are just a ‘means to an end’, therefore

•  we want to make them discoverable and accessible,

maximizing their use to assist the virtuous data cycle,

from generation to standardization through publication to

subsequent sharing and reuse

But how much do we know about these standards

Page 42: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

42

A catalogue to map the landscape of standards and the systems implementing them: Over 400 bio-standards (public and in curation)

Field*, Sansone* et al., Omics data sharing. Science 326, 234-36 (2009) doi:0.1126/science.1180598

Page 43: B4OS-2012

•  A coherent, curated and searchable catalogue of data sharing resources •  Bioscience standards and associated data-sharing policies, publications, tools and databases •  Assessment criteria for usability and popularity of standards •  Relationships among standards •  Encouragement for communication & interaction among groups •  Promoting interoperability & informed decisions about standards

Page 44: B4OS-2012

Example of multi-assays study – how many ‘standards’ are applicable to this?

Page 45: B4OS-2012

Example of multi-assays study – how many ‘standards’ are applicable to this?

Page 46: B4OS-2012

Example of multi-assays study – how many ‘standards’ are applicable to this?

Page 47: B4OS-2012

Example of multi-assays study – how many ‘standards’ are applicable to this?

Page 48: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

Smith et al, 2007

Page 49: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

Smith et al, 2007

Taylor, Field, Sansone et al, 2008

Page 50: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

50

List of databases, linked to standards a collaboration with Database Issue

Page 51: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

51

List of databases, linked to standards a collaboration with Database Issue

Page 52: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

52

List of databases, linked to standards a collaboration with Database Issue

Page 53: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

53

The relationship among popular standard formats for pathway information BioPAX and PSI-MI are designed for data exchange to and from databases and pathway and network data integration. SBML and CellML are designed to support mathematical simulations of biological systems and SBGN represents pathway diagrams.

CREDIT: Demir, et al., The BioPAX community standard for pathway data sharing, 2010.

Major challenge: define ‘relations’ among standards

Page 54: B4OS-2012
Page 55: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

55

This is not just a technical but also a social engineering challenge!

Page 56: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

56

Ownership of open standards can be problematic in broad, grass-root collaborations; it

requires improved models, to encourage maintenance of and contributions to these efforts,

supporting their evolutions

Page 57: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

57

The extensive ‘social engineering’ and community liaison needs to be managed

and funded; rewards and incentives need to be identified

for all contributors

Page 58: B4OS-2012

CC BY

http://www.flickr.com/photos/idiolector/289490834/

Page 59: B4OS-2012
Page 60: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

60

The cost of implementing a standards-supported data

sharing vision is as large as the number of stakeholders that must operate synchronously

Page 61: B4OS-2012

§ Several data preservation, management and sharing policies have emerged in response to increased funding for omics domains

§ Even if in general terms, standards are recognized as necessary ‘tools’ to unambiguously represent, describe and communicate research data

1. Funders actively developing data policies

Page 62: B4OS-2012
Page 63: B4OS-2012

§  “… lack of standardized data affects CDER’s review processes by curtailing a reviewer’s ability to perform integral tasks such as rapid acquisition, storage, analysis......efficient management of a portfolio of standards projects will require coordinated efforts and clear roles for multiple participants within/outside FDA”

2. Similar trend in the regulatory arena

Page 64: B4OS-2012
Page 65: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

65

§ Continue to support the development of open standards and tools •  to support sharing of sufficiently well annotated datasets •  to enable comprehensible, reusable, reproducible research

3. Publishes have become strong advocators

Page 66: B4OS-2012

….the rise of data-driven journals, e.g.:

partnering with:

Page 67: B4OS-2012
Page 68: B4OS-2012

The rise of data-driven journals, e.g.:

partnering with:

Page 69: B4OS-2012

§ R&D has invested heavily in procedures and tools that integrate external information with their own data to enhance the decision-making process

•  Now joining forces to streamline non-competitive elements of the life science workflow by the specification of common standards, business terms, relationships and processes

4. Similar trend in the commercial sector

Page 70: B4OS-2012

Big Life Science

Company

Yesterday Today Tomorrow

Yesterday Today Tomorrow Innovation Model

Innovation inside Searching for Innovation Heterogeneity of collaborations; part of the wider ecosystem

IT Internal apps & data Struggling with change security and trust

Cloud, services

Data Mostly inside In and out Distributed

Portfolio Internally driven and owned Partially shared Shared portfolio

Credit to: Pistoia Alliance

Big Life Science

Company

Proprietary content provider

Public content provider

Academic group

Software vendor

CRO

Service provider

Regulatory authorities

....their information landscape is evolving

Page 71: B4OS-2012

u  Contribute to the reproducible research movement

u  Think about data management as a career path

u  Learn more about open community-standards

u  Get involved, e.g.:

Open Bioinformatics Foundation

Take home messages

Page 72: B4OS-2012

http://www.flickr.com/photos/jackofspades/4500411648/ CC BY

Data is not like a $ bill….

Page 73: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

73

http://www.flickr.com/photos/equinoxefr/2620239993/ CC BY

Your research and all (publicly funded) research should make

make an … impact

Page 74: B4OS-2012

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

74

http://www.flickr.com/photos/webhamster/2582189977/ CC BY

…..the biggest possible impact!

Page 75: B4OS-2012

Today:

“The buzz around reproducible bioscience data -

the policies, the communities and the standards”

Thursday:

“The reality from the buzz: how to deliver

reproducible bioscience data”

Page 76: B4OS-2012

Is it possible to achieve a common, structured

representation of diverse bioscience experiments that:

•  follows the appropriate community standards and

•  delivers richly-annotated datasets?

Page 77: B4OS-2012

Tim Berners-Lee’s 5-star deployment scheme for Linked Open Data

Page 78: B4OS-2012

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Increasing level of structure

www.biosharing.org

www.isacommons.org

TOWARDS INTEROPERABLE BIOSCIENCE DATA

Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA, Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo C, Forster M, Gaudet P, Gilbert J, Goble C, Griffin J, Jacob D, Kleinjans J, Harland L, Haug K, Hermjakob H, Sui S, Laederach A, Liang S, Marshall S, Merrill E, McGrath A, Reilly D, Roux M, Shamu C, Shang C, Steinbeck C, Trefethen A, Williams-Jones B, Wolstencroft K, Xenarios J, Hide W.

Feb 2012

www.isacommons.org

doi:10.1038/ng.1054

Page 79: B4OS-2012

1. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ; OBI Consortium, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone SA, Scheuermann RH, Shah N, Whetzel PL, Lewis S: The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25(11):1251-1255 (2007)

2. Taylor CF,* Field D*, Sansone SA*, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, Brazma A, Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy NW, Hermjakob H, Julian RK Jr, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novère N, et al.: Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 26(8):889-896 (2008)

3. Field D*, Sansone SA*, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S, Mugabushaka AM, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J: Megascience. 'Omics data sharing. Science 326(5950):234-236 (2009)

4. Harland L, Larminie C, Sansone SA, Popa S, Marshall MS, Braxenthaler M, Cantor M, Filsell W, Forster MJ, Huang E, Matern A, Musen M, Saric J, Slater T, Wilson J, Lynch N, Wise J, Dix I: Empowering industrial research with shared biomedical vocabularies. Drug Discov Today 16(21-22):940-947 (2011)

5. Sansone SA and Rocca-Serra P: On the evolving portfolio of community-standards and data sharing policies: turning challenges into new opportunities. GigaScience 1:10 (2012)

References