pan-data wp7 - integration brian matthews stfc-e-science

17
PaN-data WP7 - Integration Brian Matthews STFC-e- Science

Upload: cecelia-keene

Post on 01-Apr-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

PaN-data WP7 - Integration

Brian Matthews STFC-e-Science

Page 2: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Integration Workpackage• Last work-package to start

– M8 (January)– Goes on to the end of project

• Dependencies on outcomes of other WPs– Users, Data, Software

• Deliverables– D7.1: Report on survey of publication repositories, cross-linking and

long-term preservation (M12).– D7.2: Proposal for integration of practices (M16).– D7.3 : Final report on standards for publication repositories, cross-

linking and long-term preservation (M18)• STFC (4 SM), DLS (2 SM), others @ 0.5SM

• Early, so now general ideas on the work in the area.– Get the right people together in advance,– Quite an open ended work-package– Start thinking

Page 3: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

WP7 Development of standards for integration and cross-linking of outputsObjectives • To foster the integration of the whole science lifecycle, focussing on linking of publications and data,

interaction between institutional repositories of publications, packaging for long-term preservation, and services for search and reuse.

Methodology:• Publications repositories complete the lifecycle of innovation. Linking to Users, Data and Software enable

traceability of published results through the scientific process. Sharing of the final results provides a foundation for the next cycle of science, and packaging enables long-term preservation of the outputs of research. Association of data with the publications resulting from it is a basis for preservation through Representation Information—a term from the OAIS standard (Open Archival Information System), meaning information necessary to ensure continued understandability and usability of a digital resource.

• Furthermore, this is also a basis for reuse of data across diverse communities, since the supplementary information needed for continued understandability is also valuable for transfer across communities. The European Support Action PARSE. Insight (of which STFC is WPL) is producing a roadmap for digital preservation in Europe, informed by a large-scale survey of attitudes and practices in a wide range of scientific disciplines. The roadmap includes components such as tools for creation of Representation Information, and will be taken into account in the project work.

Task 7.1: Review existing provision for publication repositories, citation recording and long-term preservation in use across the facilities and in the user community, including facility libraries. (M8-M12)

Task 7.2: Propose strategy on integration of practices across the community (M12-M16). Task 7.3: Develop final proposal on integration of practices across the community (M17-18). (Note: the final workshop to disseminate the results of the work package takes place in WP3)DeliverablesD7.1: Report on survey of publication repositories, cross-linking and long-term preservation (M12).D7.2: Proposal for integration of practices (M16).D7.3 : Final report on standards for publication repositories, cross-linking and long-term preservation (M18)

Page 4: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Objective 7 – Integration and cross-linking of outputs

• To foster the integration of the whole science lifecycle, focussing on linking of publications and data, interaction between institutional repositories of publications, packaging for long-term preservation, and services for search and reuse.

Page 5: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Desired Information Flow

Reference Linking

Research Outputs

User registration data; Instrument allocation data etc.

Comments, annotations, ratings etc.

Risk assessment data; other sample dataAnalyse

Derived Data

Research Concept and/or

Experiment Design

Acquire Sample

Peer-review Proposal

Conduct ExperimentGenerate, Create,

& Collect Raw Data

Process Raw Data into

Derived Data

Interpret & Analyse

Results Data

Archive, Preservation & Curation

IPR, Embargo & Access Control

Validate, Reuse& Repurpose Data

Publish Research

Results Data Derived Data Processed Data Raw, Correction & Calibration Data

Papers, articles, presentations, reports

I2S2: An Idealised Scientific Research Activity Lifecycle Model

Documentation, Metadata & Storage (Reference, Provenance, Context, Calibration etc.)

Start Project

Write Proposal

(include DMP)

Scholarly Knowledge

Write Usage Reports

Publication Database

Research Activity Research Admin Activity

Archive Activity Information Flow KEY

Prepare Supplementary

Data

Prepare Manuscript

Peer Review Research Discover & Access

Appraisal & Quality Control

Programs (generate customised software)

Publication Activity

Integration and linking via

- Common information exchange model- Common tools , services and protocols

Page 6: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Facilities Lifecycle

Proposal

Approval

SchedulingExperiment

Data storage

Record Publication

Scientist submits application for

beamtime

Facility committee approves application

Facility registers, trains, and schedules

scientist’s visit

Scientists visits, facility run’s experiment

Subsequent publication registered

with facility

Raw data filtered, cleansed and stored

Data analysis

Tools for processing made available

Link

Why Link?- Discovery of results- Auditing of usage of facility- Allowing greater reuse of data- Validation of results

Page 7: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Raw DataData Analysis

Analysed Data

Publication Data

Publications

Facility 1

Raw DataData Analysis

Analysed Data

Publication Data

Publications

Facility 2

Raw DataData Analysis

Analysed Data

Publication Data

Publications

Facility 3CapacityStorage

Publications Repositories

Data Repositories

Raw Data Catalogue

Data Analysis

Analysed Data Catalogue

Publication Data Catalogue

Publications Catalogue

Single Infrastructure Single User Experience

Software Repositories

Page 8: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Objective 7 – Integration and cross-linking of outputs

• To foster the integration of the whole science lifecycle, focussing on linking of publications and data, interaction between institutional repositories of publications, packaging for long-term preservation, and services for search and reuse.

Outcomes 1. promote the linking of publications, ... to the data on which they are based,

2. foster the development of interaction between repositories of publications, ...

3. work towards packaging the full scientific results of particular experiments for archival purposes, ... aimed at the long-term preservation of the data and other results,

4. define search services ... which will enable single searches ..., and importantly will open up the possibility of reuse of data across different disciplines through the same mechanism of packaging for archival with the needed supplementary information for understanding and reuse.

Page 9: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Issues• Existing repositories• Data citation• Constructing and maintaining links

– Identifying users, data resources, software– Federating and accessing linked infrastructure– Linked Web of Data

• Digital preservation• Packaging and access

Page 10: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Existing publication management systems

• What existing methods do facilities use to track publications arising from work at their facilities?– In house– Libraries– Public services– Entry points

Page 11: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Citation of data

– Persistent Identifiers (e.g. DOIs )– Standard ways of citing data– Who do you cite?– What do you cite

• Raw data,• Derived data• Data delivered to publishers

– Data policy

Page 12: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Linking publications and data

• Find datasets that in repositories which are used to derive publications.• Find papers which are written from datasets.

– Can validate the results of the paper– Can perform new secondary analyses– Can judge the value of a data set from its use– Can give credit to data providers, tracing usage– Can also add forward links to paper- to evaluate their use.

Page 13: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Constructing Links• Ideally the archives holding the data would be notified that a paper citing them had been submitted.

– Metadata associated with those records would be updated to reflect the citations.

– The metadata in the publication repository should also link to the metadata in the data archives and vice versa.

– It would be great if this notification could be done automatically.

• Tedious to enter citations• “forward citations” (“cited-by”) are hard to track

•Builds a citation graph– Fits well with the notion of “Linked Web of Data”– Could easily be extended to other components

• Derived data• Software

Page 14: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Preservation

• Preservation policies and planning– What data to preserve, for how long ?

• Procedures for managing preservation– Persistent Ids– Maintaining media– Maintaining Links– Maintaining context

• Representation information

• Packaging preserved data for access to users

Page 15: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Access

• Cross-searching – Common metadata models– Common services

• E.g. TopCat front end on ICat– Cross-searching

• Complex data objects– OAI-ORE– SPARQL end-points

• OAIS packages

Page 16: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

TasksTask 7.1: Review existing provision for publication

repositories, citation recording and long-term preservation in use across the facilities and in the user community, including facility libraries. (M8-M12)– D7.1: Report on survey of publication repositories, cross-

linking and long-term preservation (M12).Task 7.2: Propose strategy on integration of practices across

the community (M12-M16). – D7.2: Proposal for integration of practices (M16).

Task 7.3: Develop final proposal on integration of practices across the community (M17-18)– D7.3 : Final report on standards for publication repositories,

cross-linking and long-term preservation (M18)

Page 17: PaN-data WP7 - Integration Brian Matthews STFC-e-Science

Who should be involved?• All partners involved

– Representation from managers of records of publications (libraries)

• Set up a wiki group to start thinking of issues and approaches• Evaluate user, data, software outputs for integration

• Collect information on suitable publication repositories• Collect information on suitable initiatives and standards

– Data integration and linking– Data preservation– Persistent identifiers– Data citation

• Begin to evaluate for best practice

Ready to participate with outlines at M9 workshops