rda europe magazine - 2nd issue

14
PLOTTING A COURSE FOR SUCCESS: Working Group outputs SCIENTIFIC WORKSHOPS EXPLORE HOW TO SHARE DATA SUPPORTING EARLY CAREER RESEARCHERS DATA SHARING ACROSS BORDERS: RDA plenaries @RDA_Europe Contact: [email protected] https://europe.rd-alliance.org

Upload: icordi-project

Post on 06-Apr-2016

232 views

Category:

Documents


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: RDA Europe Magazine - 2nd issue

PLOTTING A COURSE FOR SUCCESS: Working Group outputs

SCIENTIFIC WORKSHOPS EXPLORE HOW TO SHARE DATA

SUPPORTING EARLY CAREER RESEARCHERS

DATA SHARING ACROSS BORDERS: RDA plenaries

@RDA_Europe

Contact: [email protected]://europe.rd-alliance.org

Page 2: RDA Europe Magazine - 2nd issue

The Research Data Alliance is building the social and technical bridges that enable open

sharing of data

Page 3: RDA Europe Magazine - 2nd issue

The title which Christine L. Borgman, paraphra-sing Coleridge, used for her stirring keynote at the fourth RDA plenary neatly sums up the si-tuation in which the international research com-munity finds itself. As has been stated repeated-ly, we are generating more data, at faster rates, than ever before. The danger is that we will let this precious resource slip through our fingers rather than distilling it into something truly va-luable.

In response, the Research Data Alliance (RDA), funded by agencies in the United States, Australia and the European Union, is building the bridges by which data may be shared between research communities across the world. Its ran-ge of working and interest groups are focusing on the key challenges which must be overcome in order to create a thriving datasharing landsca-pe – find out what their initial conclusions were from p. 14 onwards.

Working with researchers is the foundation of the RDA approach, as shown by the project’s scientific workshops (see p. 2) and support for researchers in the early stages of their career (see p. 6), for example. In its quest to cross geo-graphical, disciplinary and sector boundaries, RDA has worked to open up data sharing in de-veloping countries (p. 8), partnered with similar projects (p. 9) and presented findings to policy makers (as an example, see p. 10). Above all, the RDA Plenaries offer a space where all interested parties can come together and work towards re-search data sharing, enriching research, contri-buting to scientific goals and ultimately empowe-ring researchers to address the crucial societal and environmental issues affecting us today.

‘Data, data, everywhere, Nor any drop to drink’

CONTENTS

Shaping the new landscape in research-data sharing: RDA scientific workshops

The Data Harvest: How sharing research data can yield knowledge, jobs and growth

RDA Europe supports early career researchers working with data

Opening up data for development: RDA at Open Data SSDC

RDA Europe teams up with ESFRI to work towards world-class research infrastructures

RDA addresses global data and computing e-infrastructure challenges

Breaking barriers to data sharing at the fourth RDA plenary

RDA plenary meetings

Plotting a course for success: An overview of early RDA results

RDA Europe partners: The European plug-in to RDA

N EDITORS: HILARY HANAHOE (TRUST-IT), MADELEINE GRAY (BSC)

Q DESIGN: LAURA BERMUDEZ (BSC)

I PHOTOGRAPHY: INGE ANGEVAARE, JOHNNY BAMBURY, KAI WEINSZIEHR

CREDITS

Page 4: RDA Europe Magazine - 2nd issue

RDA MAGAZINE • SCIENTIFIC WORKSHOPS • 3

• find out whether participants saw a role for RDA in their daily activities

• discuss what the scientific community might expect of RDA

• determine whether RDA needs to adapt to meet their expectations, and, if so, how it should do so

Over two days, workshop participants identified a num-ber of key issues, from data sharing to reuse, to publi-shing and citing data, to infrastructures and repositories.

General observationsThere was general agreement on the need for systema-tic solutions which would guarantee reproducible scien-ce in an era where data usage is largely at a distance; such solutions should evolve continuously to adapt to multidisciplinary research and different types of data (see table below) and overcome sociological hurdles. Participants emphasized that only automated work-flows will have the power to cope with increasing data demands. In addition, they underlined the need for pro-per processes for data management, access and reu-se, covering metadata, documentation, structure and

semantics. In addition to the need for seamless infras-tructures, persistent and trusted repositories, workshop participants highlighted the importance of training future generations of data scientists.

Sharing and reusing dataReusing data can only be successful if we can trust its identity, integrity, authenticity and the seriousness of all actors that are involved in the production chain. Howe-ver, the mechanisms to establish and prove trust in a seamless way are not in place. Participants suggested that data sharing had until now been hindered by a num-ber of barriers and sociological factors, such as the diffi-culty researchers experience in understanding one ano-ther’s data, a lack of high-quality metadata descriptions and a reluctance to invest time in proper documentation.

They found that there was a lack of efficient, cross-dis-ciplinary, agreed methods to describe and process data semantically in a way which allows reuse. The current tendency towards piecemeal solutions which do not scale easily cannot continue. Participants mentioned the necessity to map data to agreed reference data to faci-litate comparative analysis, but noted that establishing

Shaping the new landscape in research-data sharing:

RDA scientific workshops

How can the challenges involved in preparing for the future of data sharing be identified, mapped and addressed?

The amount and complexity of scientific data currently being produced are so great that it is impossible to pro-cess them by traditional means. Fragmentation within and across disciplinary, institutional, political and geo-graphical boundaries is increasing rather than decrea-sing. In many scientific domains, the time spent mana-ging and manipulating data to make them reusable is unsustainable without support from new, highly automa-ted processes.

With the aim of finding solutions to some of these is-sues, the Research Data Alliance (RDA) is working with the researchers who are increasingly dependent on data sharing and interoperability to carry out and ex-tend their work. In February 2014, RDA brought together leading European scientists from a range of disciplines with specialists in data infrastructure for a workshop at the Max Planck Society headquarters in Munich. The workshop’s specific objectives were to:

EXPERTS FROM A RANGE OF DIFFERENT DISCIPLINES PARTICIPATED IN THE WORKSHOP

Well-structured data Heterogeneous data sets

Data with automatically generated metadata Data with complex metadata issues

Static data Dynamically changing data

Data acquired under controlled conditions Crowd-sourced data

Centrally managed databases Widely distributed data, no clear curation

Data that are computationally simple to handle Data needing massive computing

Data that are used “raw” Data that are understandable only after processing

Numerical data Text data

Communities knowledgeable about data processing Communities scared of data

Communities with trust Communities with no tradition of sharing, even with distrust

Open data Proprietary/embargoed data, data with copyright issues

Impersonal data Data with privacy issues

Privately generated data Data with publicly funded stakeholders

WORKSHOP PARTICIPANTS IDENTIFIED SEVERAL DATA SPECTRA. CLEARLY, DIFFERENT TYPES OF DATA WILL REQUIRE DIFFERENT STRATEGIES

LREPORT CONTRIBUTORS: BERNARD SCHULTZ, LEIF LAAKSONEN, RAPHAEL RITZ, HERMAN STEHOUWER, PETER WITTENBURG AND ROB BAXTER

The workshop was held at the Max Planck Society headquarters in Munich

Photo credit: Kai Weinsziehr, Max-Planck-Gesellschaft

Page 5: RDA Europe Magazine - 2nd issue

RDA MAGAZINE • SCIENTIFIC WORKSHOPS • 54 • RDA MAGAZINE • SCIENTIFIC WORKSHOPS

and maintaining such reference data requires significant financial support.

Publishing and citing dataThe workshop attendees identified a need for a globally accepted mechanism for data publication and citation which would reflect the complexity in the area of data. They insisted that reference data must be stable, noting that in several fields persistent identifier (PID) systems have not been as stable as they need to be.

They also highlighted that there are different aspects involved in referring to accessible data. These include the necessity to refer to data objects and collections to execute workflows, as well as the need for mechanisms which would allow users to cite data that have undergo-ne quality assessment, such as being published by a journal in a scientific paper. A further issue is that it has not yet been clarified whether data publication can be as highly rated for career building as peer-reviewed scienti-fic papers. Some researchers argue that there is also a difference with respect to career intentions in each case: scientists versus data scientists.

While participants mentioned the need for an infras-tructure to store identifiers permanently, along with attri-butes and the data themselves, they highlighted that it is not currently obvious who will finance such infrastruc-ture, which may be costly. This responsibility – on the national, regional and organizational levels – needs to be established.

Infrastructures and repositoriesParticipants emphasized the need for infrastructures which would allow data to be processed efficiently. They underlined the importance of a leading role for resear-chers in the creation of infrastructures to ensure that they meet research needs.

Trustworthy, persistent repositories are the corners-tone of such infrastructures, they highlighted; these repositories require continuous funding, clear responsi-bilities and must be properly evaluated. The data and

metadata flow between them should be transparent, based on agreed interfacing and procedural standards. Infrastructures also need to encompass existing repo-sitories, which implies that many legacy systems will need to be integrated. Registries and catalogues will be necessary to easily find trusted repositories, useful ser-vices and interesting collections.

While participants decided the principle of open ac-cess should be supported, they cautioned that some data need to be protected, under certain circumstan-ces. Conversely, they also discussed the issue of com-mercialization of data, with businesses realizing the exploitable value of data and therefore investing signi-ficant amounts of money and effort to gain access to data. However, the viability of this model for scientific research is not evident, since there is a clear lack of trust at various levels.

Recommendations for RDAHaving identified some of the issues involved in sharing research data, how did workshop participants feel RDA could best respond to these challenges? Two days of stimulating debate can be summarized by the following recommendations:

RDA’s roleRDA should develop specific recommendations, appli-cation programming interfaces (APIs) and guidelines, etc. The organization should act as a forum, bringing together people who can work towards a common vi-sion of functioning, stable infrastructures. If successful, this would supersede the current collection of limited, piecemeal solutions and thereby make the construction of infrastructure more cost effective.

OrganizationRDA should aim to become more bottom-up, and the or-ganization should look to data scientists and data libra-rians, in addition to leading researchers, to get involved in RDA activities.

CompetitionIn the push towards specifications and solutions, RDA may find itself competing with big commercial players, who may achieve de facto standards simply from having got there first.

Expectations• RDA should invest in training younger generations of

data scientists.• RDA should support pilot projects, act as a clearing

house and be able to give advice on data manage-ment, access and reuse to anyone involved in re-search.

• RDA should offer a service whereby data consultants visit organizations and help them to implement so-lutions.

• RDA results should be evaluated thoroughly and com-municated honestly.

Future plansA follow-up workshop is to be held in February 2015 at CERN, the European Organization for Nuclear Re-search, in Geneva, Switzerland. Following this, a further workshop is planned for 2016 at CRNS, the French Na-tional Center for Scientific Research.

If you are a data specialist or researcher interested in exploring vital issues around data sharing and reu-se, we’d love to hear from you. Contact us for further details on how to participate by emailing [email protected].

The full workshop report is available on the RDA web-site: http://bit.ly/rda-science-workshop

THE DATA HARVEST: HOW SHARING RESEARCH DATA CAN YIELD KNOWLEDGE,

JOBS AND GROWTHIn October 2010, the High Level Group on Scientific Data presented the Riding the Wave report to the European Commission outlining a series of policy recommendations on how Europe could gain from the rising tide of scientific data. Over four years later, a team of European experts have generated a new report. The Data Harvest: How sharing research data can yield knowledge, jobs and growth pro-vides an update on the landscape described in the previous report, aiming to sound a warning on how Europe must act now to secure its standing in future data markets. In this report, we outline the benefits and challenges and offer recommendations to Eu-ropean policy makers. The seeds have been sown. Now is the time to plan the harvest.

Page 6: RDA Europe Magazine - 2nd issue

6 • RDA MAGAZINE • EARLY CAREER RESEARCHERS RDA MAGAZINE • EARLY CAREER RESEARCHERS • 7

RDA Europe supports early career researchers working with data

RDA Europe runs a programme offering funding for European early career researchers who work with data to attend RDA plenary meetings, with the aim of introducing them to RDA. As well as learning what data scientists and practitioners are doing at plenary meetings, these researchers support the RDA Working and Interest Group activities.

So far there have been two application rounds, the first for the third plenary (March 2014 in Dublin, Ireland) and the second for the fourth plenary (September 2014 in Amsterdam, the Netherlands). In total, nearly 60 applications were received and 35 researchers were approved, with applicants coming from a range of disci-plinary backgrounds. One of the early career scientists at the third plenary in Dublin was Reko Hynönen from the Finnish Meteorological Institute.

‘I’m studying ultra-low frequency wave fluctuations of solar wind and their relationship with geomagnetic pulsations, space storms and other related phenomena. Their significance to the magnetosphere-ionosphere coupling is currently unclear,’ he explains.

He took part in the Data Foundation and Terminology Working Group and Metadata Interest Group at RDA Plenary 3.

‘As I did not know what to expect of the plenary, it was interesting to see how data as a topic can bring scien-tists and scholars from so many different disciplines together. I attended working groups on data foundation and terminology and metadata, topics that, to me, felt a bit irrelevant to research. However, after attending the sessions, I think they are significant for the conti-nued use and sharing of data. Metadata in particular is important for the correct interpretation of data and for giving the credit to the people to whom it is due.’

‘Data as a topic can bring scholars from many different

disciplines together’

The Irish Research Council (IRC) also supported researchers at the early stages of their career at the third plenary by hosting a poster session at the event. One of the researchers at this poster session was Aoife M. O’Brien, an IRC PhD scholar investigating the impact of bereavement and grief on young people.

‘My research is about investigating the impact of grief on primary and post-primary students within an Irish context. During my first two years I have been sending out qualitative questionnaires to 1500 schools, I have interviewed key organizations in Ireland and now I am starting my retrospective interviews with students who have experienced loss.

‘The RDA Plenary is a place for me to network and meet new people, which could possibly lead to further development in my research. There has been very positive interest in my poster and my research, and I have never been to a conference where people ask me such good and unusual questions. The subject of grief often makes people open up and share personal experiences.’

LANNI JAKOBSSON, CSC

‘I’ve never been to a conference where people ask me such good

and unusual questions’

RDA Europe will support European early career scien-tists again for RDA Plenary 5, to be held in San Diego, California, USA, in spring 2015. Successful applicants are invited to display a poster summarizing their studies and areas of interest during the plenary meeting. They are also assigned an RDA Working or Interest Group which they are requested to attend, and are asked to write a summary of the meeting for publication following the event.

Interested in applying? Visit https://www.rd-alliance.org/plenary-meetings.html for more information. EARLY CAREER RESEARCHER AOIFE O’BRIEN DISPLAYING HER POSTER

RDA EARLY CAREER RESEARCHERS AT P4 IN AMSTERDAM | PHOTO CREDIT: INGE ANGEVAARE

Page 7: RDA Europe Magazine - 2nd issue

Opening up data for development: RDA at Open Data SSDC

In August 2014, the Committee on Data for Science and Technology (CODATA) organized the International Wor-kshop on Open Data for Science and Sustainability in Developing Countries in Nairobi. One of a series, the workshop was supported by key organizations such as the United Nations Educational, Scientific and Cultural Organization (UNESCO), Jomo Kenyatta University of Agriculture and Technology and the International Council for Science’s World Data System (WDS). Most of the par-ticipants, of whom there were around 50, were from Afri-can countries and had an earth sciences background.

The main objective was to come to a resolution on open data which could be communicated to all develo-ping countries. A paper was accepted by the audience, which will be made available in due course on the RDA website (https://rd-alliance.org). The main conclusions in this paper are similar to messages emanating from the Organisation for Economic Co-operation and Deve-lopment (OECD), G8+5 and others; it is excellent to see worldwide convergence on this issue.

The workshop also featured projects working on im-proving work with data worldwide, with Simon Hodson (CODATA), Bernard Minster (WDS) and Peter Witten-

burg (RDA) being asked to present their respective ini-tiatives. CODATA is a policy-making initiative with ex-tensive worldwide membership, while the focus of WDS is to establish a global network of trusted centres and RDA is a bottom-up organization devoted to removing concrete barriers which hamper data sharing and reuse. While there is overlap there is also complementarity, as well as close collaboration. Nevertheless, the workshop highlighted that these different roles can be confusing for people who are not closely involved. For developing countries it is already a hurdle to convince governments to become a member of CODATA, for example; persua-ding them to join more than one initiative would be out of the question. It was very important, therefore, to empha-size that participation in RDA does not require decisions at government level.

It was clear that the workshop, which offered a range of contributions, including about specific data-driven pro-jects, was well received by the African participants and that more meetings would help build capacity among African data practitioners. It is to be hoped that RDA, CODATA and WDS, working in collaboration, will be able to find funds for follow-up workshops in Africa.

LPETER WITTENBURG, MAX PLANCK INSTITUTE FOR PSYCHOLINGUISTICS

RDA MAGAZINE • RDA AND ESFRI • 98 • RDA MAGAZINE • NAIROBI WORKSHOP

RDA Europe teams up with ESFRI to work towards world-class research

infrastructuresSharing research data is pivotal to ensuring Europe’s po-sition as a leading force in the global knowledge eco-nomy, yet this can only be achieved if appropriate infras-tructure is provided. To coordinate this strategically at the European level, the European Strategy Forum on Re-search Infrastructures (ESFRI) aims to develop scientific integration in Europe and to strengthen Europe’s interna-tional outreach. Its mission is to support a coherent, stra-tegy-led approach to policy making on research infras-tructures in Europe, while facilitating multilateral initiatives leading to the better use and development of research infrastructures. To this end, ESFRI has publi-shed three roadmaps, setting out a range of areas in which projects have since been set in motion.

The Research Data Alliance (RDA) was first presented to ESFRI projects at a workshop organized by the ESFRI service tool Communication and Policy Development for Research Infrastructures in Europe (CoPoRI) in Decem-ber 2013; since then, discussions with ESFRI project members indicate a growing consensus on adopting the RDA method of working with different communities to overcome barriers and share best-practice examples.

Inspired by the way in which the internet community organized to remove hurdles on the path to globally con-nected nodes, the RDA model promotes a leading role by research communities, with cross-disciplinary wor-king groups tackling different issues. Solutions will en-hance the use of research data in all domains of science. Ensuring reliable and cost-effective connectivity and pro-viding powerful computing services to the research com-munity will also promote the exchange, reuse and pro-cessing of data.

The third RDA plenary meeting in March 2014 and the second International Conference on Research Infras-tructures in April 2014 confirmed a mutual interest be-tween RDA and the ESFRI project initiatives to explore

these ideas and to consider how best to interact. With this in mind, RDA Europe decided to organize a workshop on data sharing and inte-roperability in October 2014 in Brus-sels, supported by the European Commission. Among the topics for discussion were the following:

• Understanding the current obstacles and suggested solutions with respect to data sharing and interopera-bility, specifically in the ESFRI project initiatives.

• Reporting on current activities in RDA, in particular the first results from September 2014 and their potential impacts.

• Discussing how interaction on potential cross-discipli-nary solutions could be implemented in the near fu-ture.

• Proposing e-Infrastructure training days for early ca-reer researchers and scientists active in ESFRI project initiatives.

Further information about the workshop and its findings is available on the RDA Europe website (https://euro-pe.rd-alliance.org). We look forward to further develo-pments in the RDA collaboration with ESFRI in future.

ESFRI website: http://ec.europa.eu/research/infras-tructures/index_en.cfm?pg=esfri

ESFRI supports a coherent, strategy-led approach to policy

making on research infrastructures

Page 8: RDA Europe Magazine - 2nd issue

RDA MAGAZINE • RDA PLENARY 4 • 11

RDA adresses global data and computing e-infrastructure

challenges

Data sharing enables us to make new discoveries and rich connections. For example, combining health, envi-ronmental, population and other data to address asth-ma risk in large urban areas requires infrastructure that supports the access, use, reuse, management, coor-dination, and stewardship of relevant data sets. Simply making the data available is insufficient for the coherent sharing and interpretation of that data. To make things more challenging, different research communities have disparate data standards, policies and practices. Conse-quently, sufficient enabling data infrastructure, both tech-nical and social, is required to integrate data sets from distinct communities and enable collaboration across those communities, just as new technical infrastructure and common agreements were required to connect the computer networks that form today’s Internet.

The overarching objective of this strategic event was to present a reflection document to the European Com-mission, European Council and European Parliament as well as national and international funders and policy makers to ensure that the data and computing challen-ges that surround this issue are tackled at the highest possible level.

Topics addressed include:• Promoting innovation in science: impact on growth

and jobs.• Opening science to society: education, training and

communication.• Enabling the integration of institutional, regional and

national research capacity, including the develop-ment of efficient and sustainable mechanisms for implementing research infrastructures and e-infras-tructures.

Organizers: The Italian National Research Council, together with the European Commission - DG CONNECT, Research Data Alliance (RDA) Europe, Italian Ministry of Education, Universities and Research (MIUR), Italian Supercom-puting Center (CINECA), and Italian National Institute for Geophysics and Volcanology (INGV) organized the RDA and global Data and Computing e-Infrastructu-re challenges event within the framework of the Italian Presidency of the European Union.

Visit the RDA Europe website for further updates on this event: https://europe.rd-alliance.org

11-12 DECEMBER 2014, GRAND HOTEL PLAZA, ROME, ITALY

Organized within the framework of the Italian Presidency of the European Union, the Research Data Alliance (RDA) and global Data and Computing e-Infrastructure challenges event in Rome shows that data sharing occupies a high place on the European agenda.

Taking place twice a year, Research Data Alliance (RDA) plenaries provide an unparalleled opportunity to work with likeminded researchers and data practitioners towards effective global data sharing. They allow for reflection upon RDA’s progress so far and help pinpoint areas where further work is needed.

Breaking barriers to data sharing at the fourth RDA plenary

With more than 555 participants from 40 countries, the fourth RDA plenary, held from 22 to 24 September 2014 at De Meervaart Conference Centre, Amsterdam, was the biggest yet. The tone for the conference was set by Neelie Kroes, then Vice-President of the European Commission with responsibility for the Digital Agenda. Speaking via video link, she stressed the need for open minds to facilitate open science and commended the RDA, which she said had ‘grown beyond any expecta-tions’ and was continuing to expand in the right direction based on interoperability, infrastructure and openness.

An engaging keynote by Professor Christine L. Borg-man, Presidential Chair in Information Studies at the University of California, Los Angeles, raised some timely and important points about rewards, responsibility and incentives in data practice, as well as the mechanisms required for the reuse of data.

Meanwhile in his keynote, Barend Mons, Associate Professor in Biosemantics at the Erasmus Medical Cen-ter, University of Rotterdam and at the Leiden University Medical Center and Head of Node at ELIXIR-NL, sug-

gested that barriers to data sharing were predominantly social rather than technical. He presented one solution in the form of the FAIR model, which stipulates that re-search data should be findable, accessible, interopera-ble and reusable.

The initial outputs from four RDA Working Groups –Data Foundation and Terminology, Data Type Registries, PID Information Types and Practical Policy – were pre-sented (see ‘Plotting a course for success - An overview of early results’ on p. 14 for a more detailed summary of working group activities). Taking RDA’s work forward, a wide variety of breakout sessions allowed participants to work on key topics, including data foundation and terminology, biodiversity data and the sustainability of e-research, with RDA Working and Interest Groups and birds-of-a-feather groups.

Full details of the plenary can be found on the RDA website:https://rd-alliance.org/plenary-meetings/rda-four-th-plenary-meeting.html

NEELIE KROES OPENING THE PLENARY WITH A FIRM MESSAGE OF SUPPORT

CHRISTINE BORGMAN’S KEYNOTE “DATA, DATA, EVERYWHERE, NOR ANY DROP TO DRINK”

© In

ge A

ngev

aare

© In

ge A

ngev

aare

Page 9: RDA Europe Magazine - 2nd issue

P3 26-28 March 2014 - Dublin, Ireland• Organized by Australia in close partnership with

Ireland, the third plenary focuses on exploiting RDA’s work to date to its full potential.

• Delegates consider topics from agriculture to particle physics, from the humanities to bioinformatics.

• All parts of the data lifecycle are addressed, from foundational data to data publication and reuse.

P2 16-18 September 2013 - Washington DC, United States• The second RDA plenary brings scientists and policy

makers together with research-data practitioners and infrastructure providers from all over the world.

• Delegates are updated on governance and specialist activities, research and policy communities give feedback.

• Mutually beneficial relationships between initiatives and organizations sharing the RDA vision begin to flourish.

P5 9-11 March 2015- San Diego, California, United States• The fifth RDA plenary takes place in the United States,

with a dedicated Adoption Day on 8 March 2015. Further information:

e https://www.rd-alliance.org/plenary-meetings/ rda-fifth-plenary-meeting.html

P1 18-20 March 2013 - Gothenburg, Sweden• As big data emerges as an international priority,

RDA is created with the aim of accelerating data-driven innovation through the sharing of data.

• Sponsors from the European Commission, United States Government and Australian Government launch the organization at its first plenary.

“The World Wide Web was a great gift of science to society: now we can ensure that it helps the scientists back”Neelie Kroes

P4 22-24 September 2014 - Amsterdam, The Netherlands

• The first outputs from the RDA Working Groups are showcased, along with information about future plans, as part of a week of data-related events.

• The Early Career Resear-chers programme provides financial support to resear-chers in Europe wishing to attend the event.

12 • RDA MAGAZINE • PLENARY MEETINGS RDA MAGAZINE • PLENARY MEETINGS • 13

Page 10: RDA Europe Magazine - 2nd issue

14 • RDA MAGAZINE • WORKING GROUP OUTPUTS RDA MAGAZINE • WORKING GROUP OUTPUTS • 15

Plotting a course for success: An overview of early RDA results

The first outputs from the Research Data Alliance (RDA) Working Groups – Data Type Registries, Data Termino-logy and Foundation, PID Information Types and Practi-cal Policy – have now been published. The groups were formed to allow the experts involved to consider topics which they found of high interest and relevance. Interac-tion between the groups was high due to the overlap be-tween the topics and each group understood that their work fits into a larger picture.

At the fourth RDA plenary in Amsterdam, these groups formulated the topic of ‘Data Fabric’ as the basis of an Interest Group. This group will work on a complete fra-mework for data processing which would lead to more efficient data management and processing, as well as to reproducible data science. This will be one of the core areas for RDA groups in the future.

Data Type Registries (DTR) Working Group Co-Chairs: • Larry Lannom, Corporation for National Research

Initiatives, Virginia, United States• Daan Broeder, Max Planck Institute for

Psycholinguistics, Netherlands

Problem: Often researchers receive a file from collea-gues, follow a link, or otherwise encounter data created elsewhere that they would like to make use of in their own work. However, they may not know how to work with it, interpret it or visualize its content, being unfami-liar with the specifics of the structure and/or meaning of the data, which can range from individual observations to complex data sets. Researchers often have to stop here since it requires too much work to search for expla-nations and tools, and, where tools exist, install them.

Goal: To allow data producers to record the implicit de-tails of their data in the form of data types and to asso-ciate those types, each uniquely identified, with different instances of datasets. Data consumers can then resolve the type identifiers to type information to learn about the implicit assumptions in the data, identify services that can be used for this kind of data and find other infor-mation that can be used to understand and process the data without needing additional support from data pro-ducers. DTRs are intended to provide machine-readable information as well as human-readable information.

Solution: DTRs allow developers or researchers to add their type definitions in an open registry and, where use-ful, add references to tools that can operate on them. For example, a user who received an unknown file could query a DTR and be pointed to a visualization service able to display the data in a useful form. A fully automa-ted system could use a DTR much as the Multipurpose Internet Mail Extensions (MIME) type system enables the automatic start of a video player in the browser once a video file has been identified. We envision humans taking advantage of data types in DTRs through type definitions that clarify the nuanced and contextual as-pects of structured datasets.

GIRIDHAR MANEPALLI PRESENTING THE DTR WORKING GROUP RESULTS AT RDA P4

I PHOTO CREDITS: INGE ANGEVAARE

Human or MachineConsumers

agree

Rights

DataProcessing

Data SetDissemination

terms

Visualization

1010011010101

1010011010101

1010011010101

1010011010101

FederatedSet of TypeRegistries

VisualizationProcessingInterpretation

Domain ofServices

43

2

1

A user or machine recei-ves an unknown type (1) (for example, a file or a term). The DTR is con-tacted and returns infor-mation about an available service (2) that allows the user or machine to continue processing the content (3, 4) without the user first having to acquire knowledge.

Data types in DTRs can be used to extend or expand existing types, such as MIME types, which provide only container-level parsing information. They can additiona-lly describe experimental context, relationships between different portions of data and so on. Data types are de-liberately intended to be quite open in terms of registra-tion policies.

For example: 1. Researchers dealing with data (e.g. in a cross-disci-plinary, cross-border context) find an unknown data type and can immediately process and/or visualize its content by using the DTR service.

2. Machines that want to extract the checksum informa-tion of a data object from a persistent identifier (PID) re-cord to check whether the content is still the same. Wi-thout knowing the details of the PID service provider, the machine could ask for CKSM, for example, since this is an information type which all PID service providers have agreed upon and which is registered in the DTR.

Impact: The potential impact on scientific practices is substantial. Unknown data types can be exploited wi-thout any prior knowledge, enabling enormous gains in time and/or interoperability. In a similar way to the MIME types that allow browsers to automatically select visuali-zation software plug-ins when confronted with a certain file type extension, scientific software can make use of the definitions and pointers stored in the DTR to conti-nue processing without the user acquiring knowledge beforehand. DTRs pave the way to automatic proces-sing in our data domain, which is becoming increasing complex, without putting additional load on the resear-chers.

Of course, type creators will need to enter the requi-red information into a DTR. We assume that a federation of DTRs will be set up to satisfy different needs.

Schedule: Software to implement this DTR concept is being developed and will be available for download – vi-sit the DTR WG’s web page at https://www.rd-alliance.org/group/data-type-registries-wg.html for updates. The RDA PID Information Type (PIT) Working Group is already using the first DTR prototype version in its API. The latest version of a DTR prototype is available here: http://typeregistry.org.

This simple model will be the starting point for desig-ning DTRs, with the aim of adding to the specifications according to priorities and usage.

Unknown data types can be exploited without any

prior knowledge, enabling enormous gains in time and/or

interoperability

Page 11: RDA Europe Magazine - 2nd issue

16 • RDA MAGAZINE • WORKING GROUP OUTPUTS RDA MAGAZINE • WORKING GROUP OUTPUTS • 17

Data Foundation and Terminology (DFT) Working Group

Co-Chairs: • Gary Berg-Cross, Research Data Alliance Advisory

Council, Washington D.C., United States• Raphael Ritz, Max Planck Institute for Plasma

Physics, Germany• Peter Wittenburg, Max Planck Institute for

Psycholinguistics, Germany

Problem: Unlike the domain of computer networks where the Transmission Control Protocol (TCP)/Inter-net protocol (IP) and International Standardization Or-ganization (ISO)/Open Systems Interconnection (OSI) models serve as a common reference point, there is no common model for data organization, causing the widespread fragmentation which is evident in the data domain. The lack of a common language between data communities means that working with data is very ineffi-cient and costly, especially when integrating cross-dis-ciplinary data.

For the physical layer of data organizations, there is a clear trend towards simpler interfaces (from file sys-tems to SWIFT-like interfaces). For the virtual layer of information, which includes persistent identifiers, meta-data of different types including provenance information, rights information, relations between digital objects, etc., there are endless solutions that create enormous hurdles when federating. Almost every new data project designs yet more new data organizations and manage-ment solutions.

The group has set out simple definitions for digital data in a

registered domain

We are witnessing increasing awareness of the fact that at a certain level of abstraction, the organization and ma-nagement of data is independent of its content. We there-fore need to change the way we are creating and dealing with data to increase efficiency and cost-effectiveness.

Goals: • To steer the discussion in the data community towards

an agreed basic core model and some basic principles that will harmonize data-organization solutions.

• To foster an RDA community culture by agreeing on ba-sic terminology arising from agreed reference models.

Solution: Based on 21 data models presented by ex-perts from different disciplines and about 120 interviews and interactions with different scientists and scientific departments, the group has defined a number of simple definitions for digital data in a registered domain based on an agreed conceptualization. They have shown how these terms relate to one another, spanning a reference model of the core of data organizations.

Impact: • Members of the data community from different discipli-

nes will be able to interact more easily with each other and reach a common understanding more rapidly.

Diagram showing a sim-plified version

of the basic data model that the DFT group

worked out

• Developers will be able to design software systems for data management and processing, enabling much easier exchange and integration of data, especially in a cross-disciplinary setting.

• It will be easier to specify simple and standard appli-cation programming interfaces (APIs) to request in-formation related to a specific digital object. Software developers will be motivated to integrate APIs from the beginning and thus facilitate data re-use.

• It will bring us a step closer to automating data pro-cessing in which we could rely on self-documenting data manipulation processes and thus reproducible data science.

Schedule: Definitions will be published both in a docu-ment and a semantic wiki. You are invited to comment on these – check https://rd-alliance.org/group/data-foun-dation-and-terminology-wg.html and http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page for updates.

Persistent Identifier Information Types (PIT) Working Group

Co-Chairs: • Tobias Weigel, Deutsches Klimarechenzentrum

(DKRZ), Germany• Timothy Dilauro, John Hopkins University, Maryland,

United States

Problem: Numerous systems and providers exist for the purpose of registering and resolving persistent identi-fiers (PIDs) for digital objects and other entities. Howe-ver, different solutions usually stipulate different ways of associating additional information, such as that needed to prove identity and integrity with the PID. This means

that a different application programming interface (API) needs to be developed and maintained for each provi-der. In order to request the checksum, it should be pos-sible to program a single piece of software independent of the provider holding the PID. To facilitate this, provi-ders need to agree on a common API, register their in-formation types in a common data type registry (DTR) and agree on some core types, such as the checksum.

Goals: • To agree on a core set of information types and regis-

ter (and define) them in a commonly accessible DTR.• To provide a common API and prototypical implemen-

tation to allow PID records to be accessed using re-gistered types.

Solution: The PIT group accomplished the following: • Defined and registered a number of core PID informa-

tion types (such as checksum).• Developed a model to structure these information types.• Provided an API, including a prototypical server imple-

mentation which allows users to request certain types associated with PID records.

WO CHANG (LEFT) AND TOBIAS WEIGEL (RIGHT) PRESENTING PIT WORKING GROUP RESULTS

LOC location, path

CKSM checksum

CKSM_T checksum type

RoR owning repository

MD path to MD

Big Data Process(consuming many digital

objects from different repositories)

API

requesting checksumfor all PIDs found

Data Type Registry

Data Type Registry

PID ResolutionSystem A

checksum

PID ResolutionSystem B

check

PID ResolutionSystem C

cksm

PID 1PID 2PID 3

...PID k

makes useof DTR

definition

defined in DTR

Imagine that you have a list of PIDs referring to data you want to use in a computation, that these PIDs are registered at different providers and that you want to check whether all data objects are still the same. If all actors refer to the same entry in the DTR, interoperability is enabled, i.e. one module would be sufficient to retrieve the checksums from the appropriate resolver independent of the internal terminology used by the various providers.

Context Information

DigitalObject

PersistentID

StateInformation

DigitalCollection

BitSequence

MetadataDescription

contains

is_described_by is_i

dent

ified

_by

cont

ains

is_aggregration_of

is_a

Note: Persistent IDs are a specific type of metadata and

PID records in general describe specific properties of the DOs

Page 12: RDA Europe Magazine - 2nd issue

18 • RDA MAGAZINE • WORKING GROUP OUTPUTS RDA MAGAZINE • WORKING GROUP OUTPUTS • 19

The set of core information types currently provided can help to illustrate cross-discipline usage scenarios. It can also act as an example for a community-driven gover-nance process by which more user-driven types may be created and managed. PID service providers and com-munity experts need to come together regularly and add types to the data type registry to make full use of the possibilities of the results of the PIT group.

It is now essential to convince PID service providers such as those using the Handle System (Digital Object Identifier (DOI), European Persistent Identifier Consor-tium (EPIC), etc.) to adopt the API in order to unify ac-cess. In the diagram below, we give an example of the usage and potential of the suggested solution.

Impact: In a few years the amount and complexity of data will have increased in all sciences and there will be a grea-ter need to rely on automatic processes, as human inter-vention results in a loss of efficiency. In such scenarios, communities can exploit the wealth of the data domain while relying on semantic interoperability between all the relevant actors, for example for big data analytics. The need to write application software would be redu-ced dramatically since only one API would be supported and one module would be sufficient to retrieve the chec-ksum, for example, and to check identity and integrity.

Strengthening PID information types could also move the existing identifier systems and the overall idea of identification into a more central and fundamental po-sition, as suggested by the Data Foundation and Ter-minology (DFT) group’s core model of a digital object, leading to an enormous increase in efficiency when dealing with data.

Schedule: Software is being built to implement a first prototype ba-sed on the PIT API. This first prototype works together with the DTR prototype and both are publicly available, although not designed for production use. Check the in-formation on the PIT group’s webpage at https://www.rd-alliance.org/group/pid-information-types-wg.html

for updated versions of the prototypes, which will be available for download. Meanwhile, the time has come to convince the PID service providers to adopt the solution.

Practical Policy (PP) Working Group

Co-Chairs: • Reagan Moore, Renaissance Computing Institute (RENCI), North Carolina, United States• Rainer Stotzka, Karlsruhe Institute of Technology, Germany

Problem: Establishing trust and ensuring reproducible data scien-ce requires highly automated, safe and documented pro-cesses, particularly considering the increasing amount and complexity of data.

However, current practice in managing and proces-sing data collections is characterized by manual ope-rations and ad-hoc scripts, making verification of the results almost impossible. All operations or chains of operations used on collections of data objects should have practical policies (PPs): these should be stated in plain language and should be transformed into robust, tested, executable code. PPs are essential for reprodu-cible science; they represent an important element in the chain of building trust and are one of the core ele-ments in repository-certification processes.

Goals: • To define computer-actionable PPs that enforce pro-

per management and stewardship, automating admi-nistrative tasks, validating assessment criteria and automating types of scientific data processing.

• To identify typical application scenarios for practical policies such as replication, preservation, metadata extraction, etc.

• To collect, register and compare existing practical po-licies.

RAINER STOTZKA PRESENTING PP WORKING GROUP RESULTS AT P4

• To enable sharing, revision, adaptation and re-use of such practical policies and thus harmonize practices, learn from good examples and increase trust.

As these goals were broad in scope, the PP WG focu-sed its efforts on a few application scenarios for the co-llection and registration process.

DataManager

SelectionPolicy Inventory

Implementation Execution

Repository

replication policy X

replication policy Y

replication policy A

replication policy B

replication policy C

md extraction policy l

md extraction policy k

etc.

A policy inventory will be made available with examples of best practice. Data managers will be able to select and implement the procedures most relevant to them.

Solution: As a first step, the PP WG conducted a sur-vey to help identify the most relevant areas of practi-ce. The analysis of the survey resulted in 11 key policy areas which were tackled first:

1. contextual metadata extraction2. data access control3. data backup4. data formal control5. data retention6. disposition7. integrity (including replication)8. notification9. restricted searching

10. storage cost reports11. use agreements

Participants and interested experts were asked to des-cribe their policy suggestions in simple, semi-formal descriptions.

With this information, the WG developed a 50-page document including these descriptions, the beginning of

a conceptual analysis and a list of typical cases, such as extracting metadata from DICOM, FITS, netCDF or HDF files.

The WG will continue until Plenary 5 (March 2015) and will focus on further analysing, categorizing and descri-bing the offered policies. Currently, volunteers are re-viewing the policies and different groups have started to implement some of these policies in environments such as the Integrated Rule-Oriented Data System (iRODS) and General Parallel File System (GPFS). The goal is to register prototypical policies with suitable metadata so that people can easily find what they are looking for and re-use what they have found at the abstract, declarative or even code level. At this point, there is still much work to be done to reach a stage where the policies can be easily used.

Impact: The potential impact is huge. In an ideal scena-rio, data managers or data scientists would be able to plug useful code into their workflow chains to carry out operations at a qualitatively high level.

This would improve the quality of all operations on data collections and thus increase trust and simplify quality assessments. Large data federation initiatives such as EUDAT (http://www.eudat.eu) and the DATA-NET Federation Consortium (United States; http://da-tafed.org) are very active in this group, since they also expect to share code development/maintenance, thus saving considerable effort by re-using tested software components.

Research infrastructure experts needing to maintain community repositories can simply re-use best-practice suggestions, thus avoiding pitfalls. In particular, where these suggestions for best practice in practical policy

The need to write application software would be reduced

dramatically since only one API would be supported and one module would be sufficient

Page 13: RDA Europe Magazine - 2nd issue

20 • RDA MAGAZINE • WORKING GROUP OUTPUTS RDA MAGAZINE • WORKING GROUP OUTPUTS • 21

are combined with proper ways of organizing data, as suggested by the Data Foundation and Terminology Working Group, powerful mechanisms will be in place to simplify the data landscape and make federating data much more cost effective.

Schedule: The document mentioned above is a va-luable resource, providing inspiration and suggested policies. Once evaluated, properly categorized and described, the real advance will be registering practical policies in suitable registries so that data professionals can easily reuse them, possibly even at code level. The group intends to reach this step by the end of March 2015 in a number of policy areas, making use of the policy registry developed by EUDAT.

For more details on the PP WG, see https://www.rd-alliance.org/group/practical-policy-wg.html

RDA Europe Data Practice Analysis

Editors: Peter Wittenburg and Herman Stehouwer, Max Planck Institute for Psycholinguistics, Netherlands

What did we do?For the RDA Europe Data Practice Analysis Programme, we held a large number of interviews with data scien-tists and practitioners from various communities. We interviewed these people about various aspects of their data environment, including data acquisition, data pro-cessing, the computational environment, services and tools, and the data-related policies being applied.

We interviewed 24 communities and attended more than 70 community meetings. We combined these ob-servations with the interviews undertaken and obser-vations made in the EUDAT project (www.eudat.eu), in

the Radieschen project (www.forschungsdaten.org/index.php/Radieschen), and in the first RDA Europe Science Workshop (see ‘Shape the new landscape in research-data sharing: RDA scientific workshops’, pp. 2-5). Based on these sources of information we came to a large number of observations, which are summarized here in form of the dominant underlying data process model and 12 key observations.

Data Process ModelThe process model in the figure emerges as the domi-nant underlying process model that most data scien-tists/practitioners implicitly use when processing data. In practice the methods used in the departments devia-te slightly from this generic model in various ways, but it summarizes what is being done at an abstract level very well. It should also be remembered that aspects of data processing are often implicitly undertaken ma-nually, with ad-hoc solutions, rather than by following an explicit model.

The model helps us to clarify our observations and to identify specific steps as they relate to data, specifically: • Data is scientifically meaningful and relevant after the

pre-processing step.• Data is ready for upload to a repository after the cura-

tion step.• Data is ready for reuse after the registration step.• Data is ready for citation after the publishing step.

Currently most researchers do not explicitly distinguish between these steps. Explicitly separating these steps of the data process would increase efficiency and de-crease costs.

The model shows similarities to existing models of data processing – such as Kahn/Wilensky 2006, Re-sourceSync, CLARIN, EPOS, ENES, ENVRI, EUDAT core model, ORE, Europeana, OAIS, Datacite/EPIC, and DICE (as used by iRODS) – and it can be used to place the observations made in the analysis program as well as to describe a data management system. In the

HERMAN STEHOUWER INTRODUCING THE WORKING GROUPS, RDA P4

diagram we also show where the topics of the first RDA Working Groups could be located.

Observations

1. ESFRI projects and the recent developments wi-thin e-Infrastructure have had a strong and positive influence on data management practices.2. Open Access is supported everywhere as a basic recommendation. However in practice there are many barriers that still need to be overcome.3. Trustworthiness is a key issue and new methods are urgently required to establish trust in the entire data-processing chain.4. Legacy data is a problem in many communities, and even now new data is often badly documented and organised; we are therefore creating continuous-ly new legacy data which will require a great deal of effort to be integrated into the accessible data domain. There is 1) a lack of knowledge about principles of proper data organisation; 2) a lack of experts, time and money to change practices; 3) a lack of off-the-shelf software methods for improved data manage-ment and access.5. Big data is driving many new scientific require-ments that dictate the thorough adoption of this para-digm in increasing numbers of departments. However, big data only scales when the data-management and access methods used also scale.6. Data management needs to move towards including the logical layer of information, i.e. metadata, persis-tent identifiers (PIDs), rights, relations to other data, etc. Ultimately, the current file system-based methods are too inefficient and costly. A large amount of resear-chers’ time is wasted in finding the right data objects, interpreting them and creating meaningful collections.7. Metadata practice needs to be improved in order to help discovery and reuse (especially after some time). Guidance and ready-to-use packages and software are required to improve the situation.

8. Lack of explicitness is an issue in relation to data, which hinders efficient machine-based processing of data. This lack ranges from non-registered digital ob-jects (i.e. ones which lack PIDs), data integrity infor-mation (such as checksums), collection descriptions, encoding systems, format/syntax, and semantics up to the level of software components. Appropriate re-gistration authorities and mechanisms do exist, but often they are unknown or not used.9. Centres for managing data across communities are a clear trend. Such centres and repositories need to be established to provide a long-term reliable servi-ce to all researchers. Creating virtual collections or carrying out distributed processing jobs is still an un-resolved issue. Some aspects of distributed authen-tication and authorization are still not in place at the European level, while distributed computing, although mentioned increasingly often, is not a well-understood scenario. 10. There is a clear need for education and training to address the lack of data professionals, which ham-pers progress.11. The lack of knowledge about and of trusted infor-mation on services offered (registries, data, storage, curation, analytics, etc.) is an issue. There are many possibilities, but many people cannot cope with the information flood and have a hard time making se-lections. A more structured and trusted approach of offering information would have great impact.12. RDA needs to ensure it is a true grass-roots orga-nization. It needs to provide demonstration cases and give help and support to research communities.

Contributors to this analysis: Rob Baxter, Daan Broeder, Marina Boulanov, Fran-coise Genova, Diana Hendrix, Fotis Karagiannis, Leif Laaksonen, Eleni Petra, Gavin Pringle, Her-man Stehouwer, Constantino Thanos, Peter Wit-tenburg, Franco Zoppi.

Page 14: RDA Europe Magazine - 2nd issue

CSC, Finnish IT Center for Science

CINECA

EPCC, University of Edinburgh

Science and Technology Facilities Council (STFC)

Association of Commonwealth Universities

CNR, National Research Council of Italy

Maastricht University

Trust-IT Services Ltd

Athena Research & Innovation Centre

Centre National de la Recerche Scientifique (CNRS)

Max Planck Society

Barcelona Supercomputing Center (BSC)

RDA partners: the European plug-in into RDA

RDA Europe (312424) is a Coordination and Support action funded by the European Commission under the 7th Framework Programme (FP7-INFRASTRUCTURES-2012-1) - Project started 1 September 2012