pulverer-embo-source data-nfdp13

EMBO SourceData– Next Gen Open Accesss

Bernd PulvererChief Editor | The EMBO JournalHead | Scientific Publications

Data transparency

Scientific publishing– Dominant channel for the

dissemination of peer-reviewed data.

– Journals function as a proxy for quality in research assessment

– The rate of publishing keeps

increasing.– Papers are human-readable but

poorly machine-readable.

5/27

Title

Abstract

Synopsis

Main paper

Supp Info

Datasets

The Research Paper

Title

Abstract

Synopsis

Main paper

Supp Info

Datasets

Expert View

The Research Paper

‘Expert View’• All the data required to support the conclusions included in

the paper.• ‘General reader’ vs. ‘expert’ view of the paper:

– Expandable/collapsible ‘inline’ sections, – Copy edited.

• Restricted to select types of data and information:– Replicates– Controls, experimental optimization– ‘Negative’ results– Extended experimental protocols – Computational algorithms

• Datasets presented as separate files.• No further reaching data

6

Title

Abstract

Synopsis

Main paper

Expert View

DatasetsSource data

What is a figure?

A scientific result converted into a collection of pixels

8/27

Discoverable, rich content

‘I’m a great believer in seeing all the data – this is a very important lever that we have for transparency’

Michael Farthing, founder COPE

SourceData

Tools to publish figures as structured digital objects that link the human-readable illustrations with machine-readable metadata and ‘source data’ in order to• improve data transparency (ethics)• make published data (re)useable• enable data-oriented search

9/27

Metadata

•Focus on the biological content•Use standard identifiers and existing controlled vocabularies

Search

•Data-oriented semantic search of the literature.•Overcome some of the limitations of keyword-based search

10/27

SourceData

Data

•Figure source data files hosted by the journals•Link to data repositories

•Archive

•Transparency

•Revisualization

•Reuse

•Integration

•Search

•Discourage

manipulation

o voluntaryo ~40% papers

No

No

Yes

Yes

Data Transparency

Metadata


Search


10/27

SourceData

Data


Structured metadata:‘perturbation-observation-assay’

1. ‘Object-oriented’ representation of experimental variables: list biological components.

2. Retain the causality of the experimental design: “Measurement of Y as a function of A, B, C, using assay P in biological system S.”

3. Machine-readable representation with standard identifiers.

measured componentmeasured component

perturbed componentperturbed component

experimental system

15/27

assayed property

Data copy editors

18

Data


Metadata


Search

•Sata-oriented semantic search of the literature.•Overcome some of the limitations of keyword-based search

10/27

SourceData

Data-oriented search

Resulting hypothesis: test drug Z in disease D.

tissue Ttissue T disease D

disease D

gene xgene x

Pap

er 3

protein X protein X PPkinase Ykinase Y

Pap

er 2

kinase Ykinase Y activityactivitydrug Zdrug Z

Pap

er 1


19/27


CREBforskolin CREBforskolin CREBforskolin CREBtime

Query: More-like-this:

17/27

sdAnnotations:annotationID a sdCore:PerturbationMeasurmentExp; :linkedToPanel sdPanels:panelID; :hasVariable sdVariables:variable1; :hasVariable sdVariables:variable2; :usingBiologicalSystem sdBiolSystem:biolSystemNode; :basedOnSourcedataset sdSourceDatasets:dsID .

‘Next Generation’ Open Access

Data SearchMetadata

Raw, rare, well done...?

From raw to processed data

A data ‘ecosystem’

data accesssearch

ReaderReader

paperdata

AuthorAuthor

SourceDataSourceData

JournalsJournals Data repositoriesData repositories

26/27

Distributed infrastructure

Database

Journals

Users

Users

Res

earc

h da

taR

esea

rch

data

Smad3

Hey1

TGFbetaVE-cdh

Rad51 foci

AR

Tsc2

1 4

6 2 5

3

1,4

4

5

6

2

…

…

Rad51Nuclear

complexesTGFb, Smad3

Literature search engines

PubMed72%

PubMed72%

Europe PMC<2%

Europe PMC<2%

Google17%

Google17%

Data are published in papers

7/27

‘Publishing’ papers

‘Depositing’ datasets

Availability of published data and software

• Datasets obtained by experimentation, computation or data mining, should be made freely available, without restriction.

• Software should be described in sufficient detail to allow reproduction. If a specific implementation is the focus of the study, free access for non-commercial users is strongly recommended.

• Deposition of data should preferably be in one of the public databases prior to submission.

Data deposition

Large-scale datasets, sequences, atomic coordinates and computational models should be deposited in one of the relevant public databases prior to submission (provided private access is available at the database) and authors should include accession codes in the Materials & Methods section.

BigData

Public databases

Structural data PDB, NDB, EMDataBankFunctional genomics GEO, ArrayExpressProteomics Pride, PeptideAtlas, PASSEL

PPI IMEx consortium

Clinical genomics datasets EGA, dbGAP

Metagenomics Genbank

Computational models BioModels, JWS

search

SourceData

Data

•Figure source data files hosted by the journals•Link to ‘unstructured data’ repositories

Metadata


Search


10/27

pulverer-embo-source data-nfdp13

Education

reviewed data

types of data

data transparency ethics

data copy editors

published data reuseable

data nt lever

based search

research paper