pan-data wp7 - integration brian matthews stfc-e-science
TRANSCRIPT
PaN-data WP7 - Integration
Brian Matthews STFC-e-Science
Integration Workpackage• Last work-package to start
– M8 (January)– Goes on to the end of project
• Dependencies on outcomes of other WPs– Users, Data, Software
• Deliverables– D7.1: Report on survey of publication repositories, cross-linking and
long-term preservation (M12).– D7.2: Proposal for integration of practices (M16).– D7.3 : Final report on standards for publication repositories, cross-
linking and long-term preservation (M18)• STFC (4 SM), DLS (2 SM), others @ 0.5SM
• Early, so now general ideas on the work in the area.– Get the right people together in advance,– Quite an open ended work-package– Start thinking
WP7 Development of standards for integration and cross-linking of outputsObjectives • To foster the integration of the whole science lifecycle, focussing on linking of publications and data,
interaction between institutional repositories of publications, packaging for long-term preservation, and services for search and reuse.
Methodology:• Publications repositories complete the lifecycle of innovation. Linking to Users, Data and Software enable
traceability of published results through the scientific process. Sharing of the final results provides a foundation for the next cycle of science, and packaging enables long-term preservation of the outputs of research. Association of data with the publications resulting from it is a basis for preservation through Representation Information—a term from the OAIS standard (Open Archival Information System), meaning information necessary to ensure continued understandability and usability of a digital resource.
• Furthermore, this is also a basis for reuse of data across diverse communities, since the supplementary information needed for continued understandability is also valuable for transfer across communities. The European Support Action PARSE. Insight (of which STFC is WPL) is producing a roadmap for digital preservation in Europe, informed by a large-scale survey of attitudes and practices in a wide range of scientific disciplines. The roadmap includes components such as tools for creation of Representation Information, and will be taken into account in the project work.
Task 7.1: Review existing provision for publication repositories, citation recording and long-term preservation in use across the facilities and in the user community, including facility libraries. (M8-M12)
Task 7.2: Propose strategy on integration of practices across the community (M12-M16). Task 7.3: Develop final proposal on integration of practices across the community (M17-18). (Note: the final workshop to disseminate the results of the work package takes place in WP3)DeliverablesD7.1: Report on survey of publication repositories, cross-linking and long-term preservation (M12).D7.2: Proposal for integration of practices (M16).D7.3 : Final report on standards for publication repositories, cross-linking and long-term preservation (M18)
Objective 7 – Integration and cross-linking of outputs
• To foster the integration of the whole science lifecycle, focussing on linking of publications and data, interaction between institutional repositories of publications, packaging for long-term preservation, and services for search and reuse.
Desired Information Flow
Reference Linking
Research Outputs
User registration data; Instrument allocation data etc.
Comments, annotations, ratings etc.
Risk assessment data; other sample dataAnalyse
Derived Data
Research Concept and/or
Experiment Design
Acquire Sample
Peer-review Proposal
Conduct ExperimentGenerate, Create,
& Collect Raw Data
Process Raw Data into
Derived Data
Interpret & Analyse
Results Data
Archive, Preservation & Curation
IPR, Embargo & Access Control
Validate, Reuse& Repurpose Data
Publish Research
Results Data Derived Data Processed Data Raw, Correction & Calibration Data
Papers, articles, presentations, reports
I2S2: An Idealised Scientific Research Activity Lifecycle Model
Documentation, Metadata & Storage (Reference, Provenance, Context, Calibration etc.)
Start Project
Write Proposal
(include DMP)
Scholarly Knowledge
Write Usage Reports
Publication Database
Research Activity Research Admin Activity
Archive Activity Information Flow KEY
Prepare Supplementary
Data
Prepare Manuscript
Peer Review Research Discover & Access
Appraisal & Quality Control
Programs (generate customised software)
Publication Activity
Integration and linking via
- Common information exchange model- Common tools , services and protocols
Facilities Lifecycle
Proposal
Approval
SchedulingExperiment
Data storage
Record Publication
Scientist submits application for
beamtime
Facility committee approves application
Facility registers, trains, and schedules
scientist’s visit
Scientists visits, facility run’s experiment
Subsequent publication registered
with facility
Raw data filtered, cleansed and stored
Data analysis
Tools for processing made available
Link
Why Link?- Discovery of results- Auditing of usage of facility- Allowing greater reuse of data- Validation of results
Raw DataData Analysis
Analysed Data
Publication Data
Publications
Facility 1
Raw DataData Analysis
Analysed Data
Publication Data
Publications
Facility 2
Raw DataData Analysis
Analysed Data
Publication Data
Publications
Facility 3CapacityStorage
Publications Repositories
Data Repositories
Raw Data Catalogue
Data Analysis
Analysed Data Catalogue
Publication Data Catalogue
Publications Catalogue
Single Infrastructure Single User Experience
Software Repositories
Objective 7 – Integration and cross-linking of outputs
• To foster the integration of the whole science lifecycle, focussing on linking of publications and data, interaction between institutional repositories of publications, packaging for long-term preservation, and services for search and reuse.
Outcomes 1. promote the linking of publications, ... to the data on which they are based,
2. foster the development of interaction between repositories of publications, ...
3. work towards packaging the full scientific results of particular experiments for archival purposes, ... aimed at the long-term preservation of the data and other results,
4. define search services ... which will enable single searches ..., and importantly will open up the possibility of reuse of data across different disciplines through the same mechanism of packaging for archival with the needed supplementary information for understanding and reuse.
Issues• Existing repositories• Data citation• Constructing and maintaining links
– Identifying users, data resources, software– Federating and accessing linked infrastructure– Linked Web of Data
• Digital preservation• Packaging and access
Existing publication management systems
• What existing methods do facilities use to track publications arising from work at their facilities?– In house– Libraries– Public services– Entry points
Citation of data
– Persistent Identifiers (e.g. DOIs )– Standard ways of citing data– Who do you cite?– What do you cite
• Raw data,• Derived data• Data delivered to publishers
– Data policy
Linking publications and data
• Find datasets that in repositories which are used to derive publications.• Find papers which are written from datasets.
– Can validate the results of the paper– Can perform new secondary analyses– Can judge the value of a data set from its use– Can give credit to data providers, tracing usage– Can also add forward links to paper- to evaluate their use.
Constructing Links• Ideally the archives holding the data would be notified that a paper citing them had been submitted.
– Metadata associated with those records would be updated to reflect the citations.
– The metadata in the publication repository should also link to the metadata in the data archives and vice versa.
– It would be great if this notification could be done automatically.
• Tedious to enter citations• “forward citations” (“cited-by”) are hard to track
•Builds a citation graph– Fits well with the notion of “Linked Web of Data”– Could easily be extended to other components
• Derived data• Software
Preservation
• Preservation policies and planning– What data to preserve, for how long ?
• Procedures for managing preservation– Persistent Ids– Maintaining media– Maintaining Links– Maintaining context
• Representation information
• Packaging preserved data for access to users
Access
• Cross-searching – Common metadata models– Common services
• E.g. TopCat front end on ICat– Cross-searching
• Complex data objects– OAI-ORE– SPARQL end-points
• OAIS packages
TasksTask 7.1: Review existing provision for publication
repositories, citation recording and long-term preservation in use across the facilities and in the user community, including facility libraries. (M8-M12)– D7.1: Report on survey of publication repositories, cross-
linking and long-term preservation (M12).Task 7.2: Propose strategy on integration of practices across
the community (M12-M16). – D7.2: Proposal for integration of practices (M16).
Task 7.3: Develop final proposal on integration of practices across the community (M17-18)– D7.3 : Final report on standards for publication repositories,
cross-linking and long-term preservation (M18)
Who should be involved?• All partners involved
– Representation from managers of records of publications (libraries)
• Set up a wiki group to start thinking of issues and approaches• Evaluate user, data, software outputs for integration
• Collect information on suitable publication repositories• Collect information on suitable initiatives and standards
– Data integration and linking– Data preservation– Persistent identifiers– Data citation
• Begin to evaluate for best practice
Ready to participate with outlines at M9 workshops