publication of facility investigations brian matthews scientific information group scientific...

31
Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory [email protected]

Upload: gerard-fletcher

Post on 11-Jan-2016

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Publication of facility investigations

 

Brian Matthews

Scientific Information GroupScientific Computing Department

STFC Rutherford Appleton Laboratory

[email protected]

Page 2: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Scientific computing develop and operate computing infrastructure - HPC, PB Datastore, s/w, data management…

Funds and operates large scale science for UK Research base - physics, astronomy - chemistry, materials

ESO: Alma Array

STFC

Page 3: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Major Science Facilities

Big Science Particle Physics - exploring the very small Space Science - exploring the very large

Small ScienceUnderstanding the world around us at a molecular levelLasers, Neutron & Light Source – ISIS & Diamond

Page 4: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Facilities Support

Big Facilities for Small Science

Diamond

ISIS

CLF

Page 5: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Science at STFC Facilities

data

ComputingAnalysisModelling

knowledge

beamsample Imaging

detector

Neutrons and photons Provide complementary views of matter:

Photons “see” electric charge – high atomic number nuclei

Neutrons “see” nucleons – especially hydrogen atoms

Page 6: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

The science we do - Structure of materials

Fitting experimental data to model

Bioactive glass for bone growth

Structure of cholesterol in crude oil

Hydrogen storage for zero emission vehicles

Magnetic moments in electronic storage

• ~30,000 user visitors each year in Europe: – physics, chemistry, biology,

medicine, – energy, environmental,

materials, culture– pharmaceuticals,

petrochemicals, microelectronics

Longitudinal strain in aircraft wing

Diffraction pattern from sample

Visit facility on research campus

Place sample in beam

• Billions of € of investment– c. £400M for DLS– + running costs

• Over 5.000 high impact publications per year in Europe

– But so far no integrated data repositories

– Lacking sustainability & traceability

Page 7: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

• Similar architecture use for DLS

• Scaling is a constant concern

• Data rates keep increasing• 70TB per month

and rising

• Tailored ICAT• Reengineered

StorageD

duodesk

DLS Proposal Entry

http://duo.diamond.ac.uk/propman

2

ICAT

External lookup data:/home/oracle/

external_tables/dls33

JOB: icat_dls_propagationON: orisa.icatdls

FREQUENCY: 1 hourDB LINK: duodesk.dl.ac.uk

ACTION: Pull data from DuoDesk to ICAT

JOB: icat_dls_propogationON: orisa.icatdls

FREQUENCY: 30 minsACTION: Load lookup data into ICAT

IDMAN

CDR

JOB: SSO - SYNCRONISATION PRODON: orisa.sso

FREQUENCY: daily at 08:45DB LINK: cdr.esc.rl.ac.uk

ACTION: Pull data from CDR to IDMAN

SSO-MyProxy

vintela

GDA

valid user check

XML Ingest

StorageD

SRB Scriptsb1-storage1

Atlas Data Store

DATA PORTAL/ICAT API

Active Directory

KDC

Certificate

JOB: cron scriptON: sso-myproxy

FREQUENCY: daily at 09:18ACTION: Pull data from IDMAN to

gridmap file (mapping FedID to DN)

CA

Kerberos Token

FedID/Password

FedID/Password

Check FedID/Password

Kerberos Authentication

SRB containers Transfer data to tape

User User

SQL

Scommands

User

75

1

Diamond e-Infrastructure

8

13

12

15

19

17

16

18

28

2726

21

20

JOB: icatdls33_propagationON: orisa.icatdls33

FREQUENCY: 30 minsACTION: Push data to iKittens

DArc

lustre

EDNA MX/DNA Drop file

MX: strategy for data collection

Drop file

22

data

data

data

23

24

25

29

Local Beamline lustre Client

24

UNIX Group created for Visit/Users File to linux administrator

30

ISPyB

14

Picture location

DUO Desk Applications

4

Federal ID

iKitten Databases iKitten Databases

I12I03I02 B22B18B16I22I20I19I18I16I15I11I07I06I04 I24 B23

iKitten Databases

11

iKitten Databases

Page 8: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Proposals

Once awarded beamtime at ISIS, an entry will be created in ICAT that describes your proposed experiment.

Experiment

Data collected from your experiment will be indexed by ICAT (with additional experimental conditions) and made available to your experimental team

Analysed Data

You will have the capability to upload any desired analysed data and associate it with your experiments.

Publication

Using ICAT you will also be able to associate publications to your experiment and even reference data from your publications.

B-lactoglobulin protein interfacial structureE

xam

ple

IS

IS P

rop

osa

l

GEM – High intensity, high resolution neutron diffractometer

H2-(zeolite) vibrational frequencies vs polarising

potential of cations

Central Facility

• Secure access to user’s data

• Flexible data searching

• Scalable and extensible architecture

• Integration with analysis tools

• Access to high-performance resources

• Linking to other scientific outputs

• Data policy awarehttp://code.google.com/p/

icatproject/

Page 9: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Investigation

Publication KeywordTopic

SampleSample

ParameterDataset

Dataset Parameter

Datafile

Datafile Parameter

Investigator

Related Datafile

Parameter

Authorisation

Core Scientific Metadata Model (CSMD)

The Core Metadata model forms the information model for ICAT.

Designed to describe facilities based experiments in Structural Science.

Page 10: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

TopCat

Page 11: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

DOI’s for Data Publication

Page 12: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Is this enough?• What we have so far is good for:

– us to manage data– users to access their own data– citation of raw data

• But – Traceability and Validation?– Reuse of the data?

• Need to make context more explicit– Focussing on the dataset is the wrong subject of

discourse

Page 13: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Support the wider Facilities Lifecycle

Proposal

Approval

SchedulingExperiment

Data storage

Record Publication

Scientist submits

application for beamtime

Facility committee approves

applicationFacility registers,

trains, and schedules

scientist’s visit

Scientists visits, facility run’s experiment

Subsequent publication

registered with facility

Raw data filtered, and stored

Data analysis

Tools for processing made

available

As in PanData-ODI – D6.1 (which has much more detail)

Page 14: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Publishing Investigations• So what we want is a record of EXPERIMENTS not data.

• Thus want the record of the context– The experimental intention and actors – The instruments and configurations used– The sample – The environmental parameters and context – The Raw Data

• Thus we want to publish a record of the whole INVESTIGATION– Can get most of this this from what we have

• The Investigation becomes a “first class” research object– Published– Identified and treated as a single entity– Cited and credited– Record of the output of the facility

• Analogous to a Journal Article– Investigation as the unit of discourse for scientific facilities.

• But also as an access point for validation and reuse– Because we have a record of what actually happened.

Page 15: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Our DataCite entries are in fact Investigations (red is for “data” notion, and green is for “investigation”)

Page 16: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

“DataCite abuse”As we have seen, we use DataCite for Investigations, with Datasets

only referred from them.

Other data curators sometimes use DataCite for Publications (“documents”) that contain data: http://data.datacite.org/10.7480/OA

So “data” DOIs tend to resolve either into Investigations or Publications

• Extend the Resource Type

• Also may not want to have a landing page for all DOIs

Page 17: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Research Objects• Represent the “investigation” as a Research Object

– Research Objects (ROs) are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. Their goal is to create a class of artifacts that can encapsulate our digital knowledge and provide a mechanism for sharing and discovering assets of reusable research and scientific knowledge

• www.researchobject.org and elsewhere (WorkFlow4Ever)

• Represent Investigation as a Research Object– Build a graph structure for the links in the research object.– Using an RDF representation, URIs– Publish as a linked data object

Bechhofer, et. al. Why Linked Data is Not Enough for Scientists, Proceedings of the 10th IEEE e-Science Conference, Brisbane, Australia (2010) http://eprints.ecs.soton.ac.uk/21587/5/research-objects-final.pdf

Arif Shaon, Sarah Callaghan, Bryan Lawrence, Brian Matthews. Opening up Climate Research: a linked data approach to publishing data provenance 7th Int Digital Curation Conference (2011).

Page 18: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

RDF representation of CSMD model <!-- csmd:Investigation -->  <owl:Class rdf:about="csmd:Investigation"> <rdfs:label>Investigation</rdfs:label> <rdfs:comment>An investigation or experiment</rdfs:comment> </owl:Class>  <!-- csmd:Facility -->  <owl:Class rdf:about="csmd:Facility"> <rdfs:label>Facility</rdfs:label> <rdfs:comment>An experimental facility</rdfs:comment> </owl:Class>  <!-- csmd:Dataset -->  <owl:Class rdf:about="csmd:Dataset"> <rdfs:label>Dataset</rdfs:label> <rdfs:comment>A collection of data files and part of an investigation</rdfs:comment> </owl:Class>  <!-- csmd:Datafile -->  <owl:Class rdf:about="csmd:Datafile"> <rdfs:label>Datafile</rdfs:label> <rdfs:comment>A data file</rdfs:comment> </owl:Class>

Page 19: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

After proposal: Initialise the Research Object

Investigation #n

DOI:STFC.xxx.n

:instrument

:investigator

:n a csmd:Investigation ; csmd:investigation_doi doi:stfc.xxx.n csmd:investigation_investigationUser :iu1 ; csmd:investigation_instrument :inst1 .

:iu1 a csmd:investigationUser ; csmd:investigationUser_user :u1 .

:u1 a csmd:User .

:inst1 a csmd:Instrument .

Page 20: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

After the experimentExperimental Data Metadata

Investigation #n

DOI:STFC.xxx.n

:dataset

:instrument

:investigator

• Own metadata format (CSMD)• More or less what ICAT currently supports• Adds extra details on parameters, datasets, formats etc.

:sample

Data Storage

Page 21: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Linking Publication into Investigation

Raw Data Repository

Publication Repository

:dataset

:publication

:publication

:investigator

cito:citescito:cites

Investigation #n

DOI:STFC.xxx.n

:instrument :sample

Publication Store

Page 22: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Raw Data Repository

Derived Data Repository

Publication Repository

:dataset

:publication

:publication

:investigatorInvestigation

#nDOI:STFC.xxx.

n

:instrument :sample

• Note that derived data could be on a different site

:relatedDataset

Linking the derived data into the Investigation

Page 23: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Linking the software into the Investigation

:dataset

:relatedDataset

:publication

:publication

:investigator

• W3C Prov ontology• Assume that the software is in a repository

SoftwarePackage 1

cito:cites

cito:cites

:inputDataset

:outputDataset

:application

Software Repository

Investigation #n

DOI:STFC.xxx.n

:instrument :sample

Page 24: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Generate Landing page from RO

Page 25: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Setting the Boundary: It depends on your Point of View

Investigations

Extended Publication

E-Portfolio

Page 26: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Setting a boundary : OAI-ORE

Page 27: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Preserving Investigations

• Now becomes preserving the research object.– Preserving a linked data graph– Persistency of identifiers– Managing integrity of external artefacts.– Link checking– Copying and mirrorign – checking consistency

• Representation Information to give more context on the objects– And on the aggregate as a whole

• PDI (Provenance, Integrity etc) on the whole aggregate object – As well as components

Page 28: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Adding Preservation Information – Rep Info for various items

:dataset

:relatedDataset

:publication

:publication

:investigator

• Would probably be more• Work into a RepInfo Repository• Would also have a RepInfo Network

:applicationInvestigation #n

DOI:STFC.xxx.n

:instrument :sample

Instrument description(website)

Raw data format description (e.g.

NeXus)

Parameter description (e.g.

NXDL, Con Vocab)

Software classification

Software description

Sample description

Analysed data format description

Publication format description

Page 29: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Adding Preservation Information – Rep Info for the whole aggregate

:dataset

:relatedDataset

:publication

:publication

:investigator:applicationInvestigation

#nDOI:STFC.xxx

.n

:instrument :sample

Software classification

CSMD Vocabulary description

Page 30: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Summary• Investigation appropriate unit of discourse for facilities science

– Publishable, Citable, Reportable– Can be used as a vehicle for validation and reuse

• Basic principles of building research objects for facilities science– Follow research lifecycle– Consider Investigation a RO “seed”– Apply Linked Data principles– Re-use existing vocabularies and ontologies– Share ROs via recognizable data formats and APIs

• Applicable beyond Facilities– Other analogous objects:– “experiments”, “observations”, “studies”

• The subject of preservation– How do we maintain the integrity of Investigation objects?

Page 31: Publication of facility investigations Brian Matthews Scientific Information Group Scientific Computing Department STFC Rutherford Appleton Laboratory

Thank You

Questions?

[email protected]

www.e-science.stfc.ac.uk