the data explosion · •not possible to predict how a crystal structure might be used in the...
TRANSCRIPT
The data explosion
…and the need to manage diverse data
sources in scientific research
Simon Coles ([email protected])
Director, UK National Crystallography Service
Why manage?
• Volume
– Day to day coping
– How to “publish”??
• Scientific Responsibility
– Accurate recording of the whole experiment
– Archival & curation
– Enabling future science
2
• Diversity
– Heterogeneous data
– Disciplinary differences
– Institutional boundaries
• Accountability
– Auditing and charging
– Funder mandates
What can structures be used for?
• For example…
– Materials science – link to properties, starting point for calculations
– Insight into / control of crystallisation & crystallisability
– Starting point for computational chemistry
• Data management is key, because its all about context and linking…
• Good crystallographic practice requires information about the experiment not covered wholly by CIF
• Not possible to predict how a crystal structure might be used in the future – measure everything you possibly can! 3
Increasing complexity
• Even for pure crystallographic studies volumes of data are getting bigger
• ‘Informatics’ approaches being adopted in order to develop fundamental understanding, rules etc
• Must be able to reliably evaluate quality (provenance) to incorporate results of others
• of others
4
The whole research process
5
Reference
Linking
Research Outputs
User registration
data; Instrument
allocation data
etc.
Comments,
annotations,
ratings etc.
Risk
assessment
data; other
sample data Analyse Derived
Data
Research Concept and/or
Experiment Design
Acquire Sample
Peer-review Proposal
Conduct Experiment Generate, Create,
& Collect Raw Data
Process Raw Data into
Derived Data
Interpret & Analyse
Results Data
Archive, Preservation & Curation
IPR, Embargo & Access Control
Validate, Reuse & Repurpose Data
Publish Research
Results Data Derived Data Processed Data Raw, Correction
& Calibration
Data
Papers, articles,
presentations, reports
An Idealised Scientific Research Data Lifecycle Model
Documentation, Metadata & Storage (Reference, Provenance, Context, Calibration etc.)
Start Project
Write Proposal
(include DMP)
Scholarly Knowledge
Write Usage Reports
Publication Database
Research Activity Research Admin
Activity
Archive Activity Information Flow KEY
Prepare Supplementary
Data
Prepare Manuscript
Peer Review Research Discover & Access
Appraisal & Quality Control
Programs (generate customised software)
Publication
Activity
How many standards?!
6
Referenc
e Linking
Research
Outputs
User
registration
data;
Instrument
allocation data
etc.
Comment
s,
annotation
s, ratings
etc.
Risk
assessme
nt data;
other
sample
data
Analyse
Derived
Data
Research
Concept
and/or
Experiment
Design
Acquire Sample
Peer-review
Proposal
Conduct Experiment
Generate, Create,
& Collect Raw Data
Process Raw
Data into
Derived Data
Interpret &
Analyse
Results
Data
Archive, Preservation & Curation
IPR, Embargo & Access Control
Validate, Reuse
& Repurpose Data
Publish
Researc
h
Results Data Derived Data Processed Data Raw, Correction
& Calibration
Data
Papers, articles,
presentations,
reports
Documentation, Metadata & Storage
(Reference, Provenance, Context, Calibration etc.)
Start Project
Write Proposal
(include DMP)
Scholarly
Knowledge
Write
Usage
Reports
Publicatio
n
Database
Research
Activity
Research
Admin Activity
Archive Activity Information Flow
KEY
Prepare
Supplementa
ry Data
Prepare
Manuscript
Peer Review
Research Discover & Access
Appraisal & Quality Control
Programs (generate customised software)
Publication
Activity
CCDC; PDB; IUCrXML; CIF
CML; PDBML; TopCAT; TARDIS;
eCrystals; DCC
P&E; DCC; NSF; ICAT;
NCS
P&E; CIF; imgCIF; ICAT; NeXus
Data management (eco)systems
• Publication Repository [Results Data]
• LIMS [User & Sample Data]
• ELN [Supporting Experiment Data]
• Datastores [Underlying Diffraction Data]
• Desktop/Laptop [Everything!]
• iPad/iPhone [Anything!]
• How to integrate these??7
Structured Data. i.e. using LIMS, Datastores & Repositories
LIMS: Lifecycle Management
9
Approval
Submission
ExperimentAnalysis
Publication
Proposal
Admin
Reporting
LIMS: Lifecycle Management
10
Approval
Submission
ExperimentAnalysis
Publication
Proposal
Admin
Reporting
LIMS: Lifecycle Management
11
Approval
Submission
ExperimentAnalysis
Publication
Proposal
Admin
Reporting
LIMS: Lifecycle Management
12
Approval
Submission
ExperimentAnalysis
Publication
Proposal
Admin
Reporting
LIMS: Lifecycle Management
13
Approval
Submission
ExperimentAnalysis
Publication
Proposal
Admin
Reporting
http://ecrystals.chem.soton.ac.uk
Management Across Boundaries
• Management across facilities (ICAT Information Model)
14
Conceive Research Propose Experiment Analyse Publish
Approval
Submission
ExperimentAnalysis
Publication
Proposal
Proposal Approval Schedule Experiment Archive Analyse Publish
NCS User
NCS
Central Facility
Datastores, ICAT & CSMD
The Core Scientific Metadata model forms the information model for ICAT & is designed to describe facilities-based experiments
15
RDBMS
Web Services API
ICAT API
Command Line Tools
Glassfish / JBOSS
JavaC++Fortran
Data Storage/ Delivery System
Single Sign On
User Database System
Proposal System
Publication System
e-Science Services
Software Repository
CSMD as a Starting Point
• Forms the basis for extensions:
- To derived data
- To laboratory based science
- To secondary analysis data
- To preservation information
- To publication data
16
Investigation
Publication KeywordTopic
SampleSample
ParameterDataset
Dataset Parameter
Datafile
DatafileParameter
Investigator
Related Datafile
Parameter
Authorisation
Getting Mobile in the Lab
• But every crystallographic experiment has ‘unstructured’ data
17
Getting Mobile in the Lab
18
Package & Sample Tube
Bulk Sample & Manipulation
Mounted Sample
Trial Diffraction Pattern
Getting Mobile in the Lab
The completed record…
19
Unstructured data.
Invariably the context for a crystal structure
Or
The context a crystal structure has to fit into!
Laboratory Notebooks
• All this is not a new problem…
21
23
LabTrove
Imposing structure – Planning & Enactment Ontology
• Plan (Prospective provenance)
24
• Enactment (Retrospective provenance)
• Realisation
oreChem Plan for eCrystals
• Machine-readable representation of methodology
• Describes requirements for software and data products
25
Bringing it all together.
Structures in context of compound, property, etc data
27RDF (ChemAxiom)
Structures in the context of ‘traditional’ publishing
• ELN providing Electronic Supplementary Information for conventional publication (Chemistry Central: accepted)
28
DOI’s for Data
• DataCite Institutional registration of DOI’s
• Promotes rigorous ‘self publishing’ of data
• Soton exemplars
• eCrystals(Repository/structured data)
29
DOI’s for Data
• LabTrove (unstructured data)
• Locked down HTML export of selected records
30
SIMS: Sample Information Management System
• A standard/format for crystallographic sample and experiment data management and archival
• Supported by CrystalClear and NCS Portal, providing interaction between facility, instruments and CIF, ImgCIF etc
31
elnItemManifest
• 3 layer metadata model for description, export & packaging
• This is the first (information) layer – leads into knowledge
• Published through Dial-a-Molecule athttp://wp.me/p2JoQ6-xF & submitted to J. ChemInf
32
Thanks
33