the data explosion · •not possible to predict how a crystal structure might be used in the...

33
The data explosion …and the need to manage diverse data sources in scientific research Simon Coles ([email protected] ) Director, UK National Crystallography Service

Upload: others

Post on 22-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

The data explosion

…and the need to manage diverse data

sources in scientific research

Simon Coles ([email protected])

Director, UK National Crystallography Service

Page 2: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Why manage?

• Volume

– Day to day coping

– How to “publish”??

• Scientific Responsibility

– Accurate recording of the whole experiment

– Archival & curation

– Enabling future science

2

• Diversity

– Heterogeneous data

– Disciplinary differences

– Institutional boundaries

• Accountability

– Auditing and charging

– Funder mandates

Page 3: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

What can structures be used for?

• For example…

– Materials science – link to properties, starting point for calculations

– Insight into / control of crystallisation & crystallisability

– Starting point for computational chemistry

• Data management is key, because its all about context and linking…

• Good crystallographic practice requires information about the experiment not covered wholly by CIF

• Not possible to predict how a crystal structure might be used in the future – measure everything you possibly can! 3

Page 4: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Increasing complexity

• Even for pure crystallographic studies volumes of data are getting bigger

• ‘Informatics’ approaches being adopted in order to develop fundamental understanding, rules etc

• Must be able to reliably evaluate quality (provenance) to incorporate results of others

• of others

4

Page 5: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

The whole research process

5

Reference

Linking

Research Outputs

User registration

data; Instrument

allocation data

etc.

Comments,

annotations,

ratings etc.

Risk

assessment

data; other

sample data Analyse Derived

Data

Research Concept and/or

Experiment Design

Acquire Sample

Peer-review Proposal

Conduct Experiment Generate, Create,

& Collect Raw Data

Process Raw Data into

Derived Data

Interpret & Analyse

Results Data

Archive, Preservation & Curation

IPR, Embargo & Access Control

Validate, Reuse & Repurpose Data

Publish Research

Results Data Derived Data Processed Data Raw, Correction

& Calibration

Data

Papers, articles,

presentations, reports

An Idealised Scientific Research Data Lifecycle Model

Documentation, Metadata & Storage (Reference, Provenance, Context, Calibration etc.)

Start Project

Write Proposal

(include DMP)

Scholarly Knowledge

Write Usage Reports

Publication Database

Research Activity Research Admin

Activity

Archive Activity Information Flow KEY

Prepare Supplementary

Data

Prepare Manuscript

Peer Review Research Discover & Access

Appraisal & Quality Control

Programs (generate customised software)

Publication

Activity

Page 6: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

How many standards?!

6

Referenc

e Linking

Research

Outputs

User

registration

data;

Instrument

allocation data

etc.

Comment

s,

annotation

s, ratings

etc.

Risk

assessme

nt data;

other

sample

data

Analyse

Derived

Data

Research

Concept

and/or

Experiment

Design

Acquire Sample

Peer-review

Proposal

Conduct Experiment

Generate, Create,

& Collect Raw Data

Process Raw

Data into

Derived Data

Interpret &

Analyse

Results

Data

Archive, Preservation & Curation

IPR, Embargo & Access Control

Validate, Reuse

& Repurpose Data

Publish

Researc

h

Results Data Derived Data Processed Data Raw, Correction

& Calibration

Data

Papers, articles,

presentations,

reports

Documentation, Metadata & Storage

(Reference, Provenance, Context, Calibration etc.)

Start Project

Write Proposal

(include DMP)

Scholarly

Knowledge

Write

Usage

Reports

Publicatio

n

Database

Research

Activity

Research

Admin Activity

Archive Activity Information Flow

KEY

Prepare

Supplementa

ry Data

Prepare

Manuscript

Peer Review

Research Discover & Access

Appraisal & Quality Control

Programs (generate customised software)

Publication

Activity

CCDC; PDB; IUCrXML; CIF

CML; PDBML; TopCAT; TARDIS;

eCrystals; DCC

P&E; DCC; NSF; ICAT;

NCS

P&E; CIF; imgCIF; ICAT; NeXus

Page 7: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Data management (eco)systems

• Publication Repository [Results Data]

• LIMS [User & Sample Data]

• ELN [Supporting Experiment Data]

• Datastores [Underlying Diffraction Data]

• Desktop/Laptop [Everything!]

• iPad/iPhone [Anything!]

• How to integrate these??7

Page 8: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Structured Data. i.e. using LIMS, Datastores & Repositories

Page 9: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

LIMS: Lifecycle Management

9

Approval

Submission

ExperimentAnalysis

Publication

Proposal

Admin

Reporting

Page 10: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

LIMS: Lifecycle Management

10

Approval

Submission

ExperimentAnalysis

Publication

Proposal

Admin

Reporting

Page 11: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

LIMS: Lifecycle Management

11

Approval

Submission

ExperimentAnalysis

Publication

Proposal

Admin

Reporting

Page 12: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

LIMS: Lifecycle Management

12

Approval

Submission

ExperimentAnalysis

Publication

Proposal

Admin

Reporting

Page 13: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

LIMS: Lifecycle Management

13

Approval

Submission

ExperimentAnalysis

Publication

Proposal

Admin

Reporting

http://ecrystals.chem.soton.ac.uk

Page 14: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Management Across Boundaries

• Management across facilities (ICAT Information Model)

14

Conceive Research Propose Experiment Analyse Publish

Approval

Submission

ExperimentAnalysis

Publication

Proposal

Proposal Approval Schedule Experiment Archive Analyse Publish

NCS User

NCS

Central Facility

Page 15: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Datastores, ICAT & CSMD

The Core Scientific Metadata model forms the information model for ICAT & is designed to describe facilities-based experiments

15

RDBMS

Web Services API

ICAT API

Command Line Tools

Glassfish / JBOSS

JavaC++Fortran

Data Storage/ Delivery System

Single Sign On

User Database System

Proposal System

Publication System

e-Science Services

Software Repository

Page 16: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

CSMD as a Starting Point

• Forms the basis for extensions:

- To derived data

- To laboratory based science

- To secondary analysis data

- To preservation information

- To publication data

16

Investigation

Publication KeywordTopic

SampleSample

ParameterDataset

Dataset Parameter

Datafile

DatafileParameter

Investigator

Related Datafile

Parameter

Authorisation

Page 17: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Getting Mobile in the Lab

• But every crystallographic experiment has ‘unstructured’ data

17

Page 18: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Getting Mobile in the Lab

18

Package & Sample Tube

Bulk Sample & Manipulation

Mounted Sample

Trial Diffraction Pattern

Page 19: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Getting Mobile in the Lab

The completed record…

19

Page 20: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Unstructured data.

Invariably the context for a crystal structure

Or

The context a crystal structure has to fit into!

Page 21: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Laboratory Notebooks

• All this is not a new problem…

21

Page 22: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Electronic Lab Notebooks

http://www.labtrove.org/

Page 23: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

23

LabTrove

Page 24: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Imposing structure – Planning & Enactment Ontology

• Plan (Prospective provenance)

24

• Enactment (Retrospective provenance)

• Realisation

Page 25: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

oreChem Plan for eCrystals

• Machine-readable representation of methodology

• Describes requirements for software and data products

25

Page 26: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Bringing it all together.

Page 27: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Structures in context of compound, property, etc data

27RDF (ChemAxiom)

Page 28: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Structures in the context of ‘traditional’ publishing

• ELN providing Electronic Supplementary Information for conventional publication (Chemistry Central: accepted)

28

Page 29: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

DOI’s for Data

• DataCite Institutional registration of DOI’s

• Promotes rigorous ‘self publishing’ of data

• Soton exemplars

• eCrystals(Repository/structured data)

29

Page 30: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

DOI’s for Data

• LabTrove (unstructured data)

• Locked down HTML export of selected records

30

Page 31: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

SIMS: Sample Information Management System

• A standard/format for crystallographic sample and experiment data management and archival

• Supported by CrystalClear and NCS Portal, providing interaction between facility, instruments and CIF, ImgCIF etc

31

Page 32: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

elnItemManifest

• 3 layer metadata model for description, export & packaging

• This is the first (information) layer – leads into knowledge

• Published through Dial-a-Molecule athttp://wp.me/p2JoQ6-xF & submitted to J. ChemInf

32

Page 33: The data explosion · •Not possible to predict how a crystal structure might be used in the future –measure everything you possibly can! 3. ... i.e. using LIMS, Datastores & Repositories

Thanks

33