pulverer-embo-source data-nfdp13
DESCRIPTION
Presentation by Bernd Pulverer on EMBO's 'Source Data' and the next generation of open access given at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UKTRANSCRIPT
EMBO SourceData– Next Gen Open Accesss
Bernd PulvererChief Editor | The EMBO JournalHead | Scientific Publications
Data transparency
Scientific publishing– Dominant channel for the
dissemination of peer-reviewed data.
– Journals function as a proxy for quality in research assessment
– The rate of publishing keeps
increasing.– Papers are human-readable but
poorly machine-readable.
5/27
Title
Abstract
Synopsis
Main paper
Supp Info
Datasets
The Research Paper
Title
Abstract
Synopsis
Main paper
Supp Info
Datasets
Expert View
The Research Paper
‘Expert View’• All the data required to support the conclusions included in
the paper.• ‘General reader’ vs. ‘expert’ view of the paper:
– Expandable/collapsible ‘inline’ sections, – Copy edited.
• Restricted to select types of data and information:– Replicates– Controls, experimental optimization– ‘Negative’ results– Extended experimental protocols – Computational algorithms
• Datasets presented as separate files.• No further reaching data
6
Title
Abstract
Synopsis
Main paper
Expert View
DatasetsSource data
What is a figure?
A scientific result converted into a collection of pixels
8/27
Discoverable, rich content
‘I’m a great believer in seeing all the data – this is a very important lever that we have for transparency’
Michael Farthing, founder COPE
SourceData
Tools to publish figures as structured digital objects that link the human-readable illustrations with machine-readable metadata and ‘source data’ in order to• improve data transparency (ethics)• make published data (re)useable• enable data-oriented search
9/27
Metadata
•Focus on the biological content•Use standard identifiers and existing controlled vocabularies
Search
•Data-oriented semantic search of the literature.•Overcome some of the limitations of keyword-based search
10/27
SourceData
Data
•Figure source data files hosted by the journals•Link to data repositories
•Archive
•Transparency
•Revisualization
•Reuse
•Integration
•Search
•Discourage
manipulation
o voluntaryo ~40% papers
12/27
No
No
Yes
Yes
Data Transparency
Metadata
•Focus on the biological content•Use standard identifiers and existing controlled vocabularies
Search
•Data-oriented semantic search of the literature.•Overcome some of the limitations of keyword-based search
10/27
SourceData
Data
•Figure source data files hosted by the journals•Link to data repositories
Structured metadata:‘perturbation-observation-assay’
1. ‘Object-oriented’ representation of experimental variables: list biological components.
2. Retain the causality of the experimental design: “Measurement of Y as a function of A, B, C, using assay P in biological system S.”
3. Machine-readable representation with standard identifiers.
measured componentmeasured component
perturbed componentperturbed component
experimental system
15/27
assayed property
Data copy editors
18
Data
•Figure source data files hosted by the journals•Link to data repositories
Metadata
•Focus on the biological content•Use standard identifiers and existing controlled vocabularies
Search
•Sata-oriented semantic search of the literature.•Overcome some of the limitations of keyword-based search
10/27
SourceData
Data-oriented search
Resulting hypothesis: test drug Z in disease D.
tissue Ttissue T disease D
disease D
gene xgene x
Pap
er 3
protein X protein X PPkinase Ykinase Y
Pap
er 2
kinase Ykinase Y activityactivitydrug Zdrug Z
Pap
er 1
Data-oriented search
19/27
Data-oriented search
CREBforskolin CREBforskolin CREBforskolin CREBtime
Query: More-like-this:
17/27
sdAnnotations:annotationID a sdCore:PerturbationMeasurmentExp; :linkedToPanel sdPanels:panelID; :hasVariable sdVariables:variable1; :hasVariable sdVariables:variable2; :usingBiologicalSystem sdBiolSystem:biolSystemNode; :basedOnSourcedataset sdSourceDatasets:dsID .
‘Next Generation’ Open Access
Data SearchMetadata
24
Raw, rare, well done...?
From raw to processed data
A data ‘ecosystem’
data accesssearch
ReaderReader
paperdata
AuthorAuthor
SourceDataSourceData
JournalsJournals Data repositoriesData repositories
26/27
Distributed infrastructure
Database
Journals
Users
Users
Res
earc
h da
taR
esea
rch
data
Smad3
Hey1
TGFbetaVE-cdh
Rad51 foci
AR
Tsc2
1 4
6 2 5
3
1,4
4
5
6
2
…
…
Rad51Nuclear
complexesTGFb, Smad3
Literature search engines
PubMed72%
PubMed72%
Europe PMC<2%
Europe PMC<2%
Google17%
Google17%
Data are published in papers
7/27
‘Publishing’ papers
‘Depositing’ datasets
Availability of published data and software
• Datasets obtained by experimentation, computation or data mining, should be made freely available, without restriction.
• Software should be described in sufficient detail to allow reproduction. If a specific implementation is the focus of the study, free access for non-commercial users is strongly recommended.
• Deposition of data should preferably be in one of the public databases prior to submission.
Data deposition
Large-scale datasets, sequences, atomic coordinates and computational models should be deposited in one of the relevant public databases prior to submission (provided private access is available at the database) and authors should include accession codes in the Materials & Methods section.
BigData
Public databases
Structural data PDB, NDB, EMDataBankFunctional genomics GEO, ArrayExpressProteomics Pride, PeptideAtlas, PASSEL
PPI IMEx consortium
Clinical genomics datasets EGA, dbGAP
Metagenomics Genbank
Computational models BioModels, JWS
search
SourceData
Data
•Figure source data files hosted by the journals•Link to ‘unstructured data’ repositories
Metadata
•Focus on the biological content•Use standard identifiers and existing controlled vocabularies
Search
•Data-oriented semantic search of the literature.•Overcome some of the limitations of keyword-based search
10/27
43