managing and sharing research data: an afternoon...
TRANSCRIPT
Managing and sharing research data:
an afternoon tour
Louise Corti
Director, Collections Development and Producer Relations
UK Data Service
University of Essex
Workshop on Research Data Management
Conference ‘Changing Landscape of Science & Technology
Libraries’
Indian Institute of Technology Gandhinagar
2-4 March 2017
Plan for this afternoon
Session 1: Overview of data management: presentation on
data management planning, and key elements for librarians
and data support services
• Quiz and planning exercises in pairs
Session 2: How the UK Data Service does it – speciality in
social and biomedical sciences: presentation
Session 3: Examples of University multi-purpose data
repositories and Indian Data Service:
• Presentations and showcase
Session 4: Skills required – questions you need to consider in
your role as data sharing advocates
• Exercise and discussion
Brief break for Gangnam style dance off!
Making data available is trending now
Open access and transparency agendas
Huge progress in opening up government data
(gov.data)
Lack of trust in published academic findings –
demands for evidence for claims and verification
Value for money from public funds
Why and how is data shared?
• Research data is often exchanged in informal
ways with collaborators and colleagues
• Formally publishing data brings many advantages
– longevity, robust citation and proper attribution
• Publishing data from research grown rapidly in
recent years – result of funder and journal
policies
Benefits of data sharing
To funders • make optimal use of publicly funded research
• maximise return on investment
• avoid duplication of data collection
To the scholarly community
• maintain professional standards of open inquiry
• quality improvement from verification, replication and trust
• develop long time series of data
• promote innovation through unintended, new uses of data
• Study documentation for research design and teaching
Benefits of data sharing
To research participants
• allow maximum use of their contributions
• minimise data collection on the hard-to-reach (e.g. ill,
elites)
To the public • production of high quality findings with social value
• advance science to the benefit of society
• compliance with laws and regulations
• adoption of emerging norms – ‘open access’publishing
• Seen to be open and accountable
Good research should be dependent on
process-visibility
Data Sharing Research
Transparency
Data used to support an evidence-based claim
Data made
available for
secondary
analysis
Transparency Organizations
International research funder policies
• Most based on:OECD Principles and Guidelines for Access to
Research Data from Public Funding
• UK: variety of models
• Data management plans and recommendation only
• Dedicated data centres
• Institutions taking responsibility
• Europe
• Communication & recommendation on access to / preservation
of scientific information (publications, data)
• USA: NSF and NIH
• Data management plans
Journal and Publisher Data Policies
• Many science journals have policies for data sharing
• Science, Nature, BioMed Central, PLOS ONE “PLOS ONE will not consider a study if the conclusions depend
solely on the analysis of proprietary data … the paper must include
an analysis of public data that validates the conclusions so others
can reproduce the analysis.”
• Data underpinning publication accessible:
• upon request from author
• as supplementary materials alongside publication
• in public repository or specialist data centre
• in mandated repository (e.g. PANGAEA – Elsevier)
• Usually need a Digital Object Identifier (DOI)
Social science journals: transparency agenda
• Replication policies exist for psychology,
economics and political science journals The Journal of Development Economics
Quarterly Journal of Economics
Quarterly Journal of Political Science
• Options: • Authors supply enough information that the exact analysis
can be replicated - raw data, survey forms, data collection
protocols, computer programs and scripts, etc.
• Authors are required to replicate their study, ideally with a
preregistered design
• Journal may have its own repository
Why do we need to plan for data management in research?
• Think how to design and implement research
• Consider how to look after research data safely
• Keep track of research data (e.g. staff leaving)
• Identify support, resources, services needed
• Plan data storage, short & long-term
• Plan data security, ethical aspects
• Plan for current and future data uses (data sharing /
publishing)
Why DMP – funder requirements
• Many research funders require a plan for data
management a part of research applications
• Expect to cost sustainable data management
and sharing into research
• Overview of requirements: • Digital Curation Centre, UK: UK Funders’ data plan
requirements
• California Digital Library, USA:
DMPTool – Funder requirements
ESRC research data policy
Research data should be openly available to the maximum extent possible
through long-term preservation and high quality data management.
(ESRC Research Data Policy, 2010)
• ESRC grant applicants planning to create data during their research include a data management plan with their application, as an attachment to the Je-S form
• ESRC award holders offer their research data to the ESRC Data Store (managed by UK Data Service) within three months of the end of their grant, to preserve them and to make them available for new research.
Researchers who collect the data initially should be aware that ESRC
expects that others will also use it, so consent should be obtained on this
basis and the original researcher must take into account the long-term use
and preservation of data. (ESRC Framework for Research Ethics, 2012)
ESRC data management plan
Assessment of existing data
Information on new data
Quality assurance of data
Backup and security of data
Expected difficulties in data sharing
Copyright / Intellectual Property Right
Responsibilities
Preparation of data for sharing and archiving
ESRC DMP guidance
Why DMP - practical
• Identify key data management decision points in
the research lifecycle
• How to address these?
• Who will address these?
Examples:
• Set up data storage area on server
• Verify institutional back-up policy is in place
• Design database with documented labels, codes
• Identify factors that limit, prohibit data sharing
and re-use before data collection starts
Data life cycle intervention
Sign off consent
form
Agree data &
metadata
templates/
organisation
Data sharing
protocols
Licensing, terms
and conditions for
sharing, formal
documentation Data formats,
data migration
Roles & responsibilities
• Project director: design, oversee research
• Research staff: design research, collect, process and
analyze data, decide where to keep data & who has access
• Laboratory or technical staff: generate metadata and doc.
• Database designer
• External contractors: data collection, data entry, transcribe,
process, analysis; agree standard protocols
• Support staff: manage and administer research and
funding, ethical review and assess property rights
• Institutional IT services: storage, security, backup services
• External Data Centres: facilitate data sharing
Cost research data management
• Cost RDM into research applications and budgets
• List and identify resources needed to make research
data shareable beyond primary research team - above
planned standard research procedures and practices
• Resources
• People and skills
• Equipment
• Infrastructure, storage and access costs
• Tools to manage, document, organise
• Early planning can reduce costs
UKDA Data management costing tool
• check data management activities in table and tick what applies to your proposed research; we propose 18 essential RDM activities STEP 1
• for each selected activity, estimate / calculate additional time and/or resources needed and cost this STEP 2
• add data management costs to your research application; coordinate resourcing and costing with your institution, research office and institutional IT services
STEP 3
Some guidance on writing a DMP
• Funder template, guidance & example plans – NSF-SBE DMP content guidance
– NIH example plans
– ESRC DMP requirements in data policy and DMP guidance
– MRC DMP guidance and template
– AHRC technical appendix requirements
• Generic tools: – DMPTool
– DMPonline tool
Brief poll and questions
• Who has undertaken research
• Who has collected /created their own data?
• Who has written a DMP?
• What Data Policies exist in India?
• National Data Sharing and Accessibility Policy (NDSAP) – non sensitive data only?
The nuts & bolts of managing and sharing
data
Formatting and organising data
Storing and transferring data, including encryption and security
Legal, ethical issues: consent, anonymisation & access control
Rights relating to research data
Documenting and contextualising data
Publishing and citing research data
Can you understand/use these data?
SrvMthdDraft.doc
SrvMthdFinal.doc
SrvMthdLastOne.doc
SrvMthdRealVersion.doc
Formatting and organising data
Consistent templates for similar kinds of data
Well organised – consistent folders
Folders/files are suitably named and properly
versioned
Identify the authenticity of master files
File formats
• Choice of software format for digital data:
• hardware used e.g. audio capture
• discipline-specific customs and planned data analyses
• software availability/cost
• Digital data endangered software/ hardware obsolesce
• Best formats for long-term preservation are standard,
interchangeable and open formats:
• tab-delimited, comma-delimited (CSV), ASCII
• SPSS portable, XML
• RTF, OpenDocument format, PDF/A,
• Recomended formats
• Beware of errors/losses of data when converting!
Format conversion
MS Excel (.XLSX) format using colour highlighting for annotation
Tab-delimited text format, and loss of colour annotation
Loss of
annotation
File naming
• File name - principal identifier of file
• Use logical naming i.e. easy to identify, locate,
retrieve, access
• Naming provides organisation, context &
consistency
• Name elements: version number, date, content
description, creator name
• For separation use underscores _
• Avoid very long file names!
Version controlling
Keep track of different copies or versions of data files
Best practice:
• unique identifiers for files (naming convention keeping track)
• record file status/versions
• record relationships between files
e.g. data file and documentation; similar data files
• keep track of file locations
e.g. laptop vs. PC
Tools available for versioning and syncronising files
Data capture - it’s all about consistency
Example: Transcription of text from audio
• Use a uniform layout throughout – use a template
• Provide guidance of how you would like the data transcribed
• Indicate speakers
• Capture verbal and non-verbal?
• Implications of various technologies – video, multiple camera, screen
capture, webcams
Example: digitisation of photographs
• Specify the expected output
• Use standard settings on equipment, plus capture of correct metadata from
images
Regardless of who does the work – rules help to make data cleaner
and easier to share!
Organising data
• Plan in advance how best to organise data
• Use a logical structure and ensure collaborators understand
Examples
• hierarchical structure of files, grouped in folders, e.g. audio, transcripts and annotated transcripts
• survey data: spreadsheet, SPSS,
relational database
• interview transcripts: individual
well-named files
Storing data safely
• Looking after data – protect from damage and loss
• Strategies in place for:
• backing-up
• transmission
• secure storage
• disposal
Backing up data
• Why do back-ups? Risk of loss and change - would your
data survive a disaster?
• Protect against: software failure, hardware failure,
malicious attack, natural disasters
• Back-ups are additional copies that can be used to
restore originals
• It’s not backed-up unless backed-up with a strategy
Digital back-up strategy
Consider
• what’s backed-up? - all, some, just the bits you change?
• where? - original copy, external local and remote copies
• what media? - CD, DVD, external hard drive, tape, etc.
• how often? – assess frequency and automate the process
• for how long is it kept? Data retention policies that might apply?
• verify and recover - never assume, regularly test a restore
Backing-up need not be expensive
• 1Tb external drives are around
£50, with back-up software
Consider non-digital storage too!
38
Encryption and security of data
Encrypt personal or sensitive data:
• when moving or storing files
• free software – easy to use: - Safehouse, Truecrypt, Axcrypt
• encrypt hard drives, partitions, USBs, files and folders
Protect from unauthorised access, change, disclosure, destruction
• control access to computers, buildings, rooms, cabinets
• restrict access to sensitive materials e.g. consent forms
Proper disposal of equipment and media
• even reformatting the hard drive is not sufficient
39
File sharing & collaborative environments
• Too often data sent via email attachments!
• Virtual Research Environments • MS SharePoint
• Cloud solutions • Google Drive, DropBox, OneDrive
• Base camp
• Locally managed; ownCloud, ZendTo
• File transfer protocol (ftp)
• Physical media
• Data Protection Act for location of data storage
Sharing confidential data
Researchers should:
• obtain informed consent from human subjects for
data sharing and preservation /curation
• protect identities by not collecting personal data as
‘research data’; or anonymise data
• restrict / regulate access where needed (all or part
of data). UK Data Service uses a spectrum of
access
Consider jointly and in dialogue with participants
Plan early in research
Disclosure review
• Direct identifiers: names; addresses; telephone numbers; email
addresses; images (and check file properties!)
• Indirect identifiers: demographics: age, ethnicity,
education/employment details, religion, household size, detailed
income, geography. Could combinations reveal identity?
• Balance confidentiality protection without compromising usability
of data. If can’t be achieved, consider more restrictive access
• Solution: discuss with data creator – data edits (recoding,
banding, aggregation, pseudonyms etc.) or access restriction?
Copyright and data sharing
• Copyright permissions sought and granted prior to data
sharing / archiving
• Clearing copyright for re-use – reach agreement with
copyright holder
• Copyright holders give permission to repositories to
preserve and publish data and provide user access
• Repositories do not inherit the copyright, but ay
have some rights e.g. database right
• ‘Fair dealing’ exception in UK Copyright Law for non-
commercial research, private study, teaching,
quotations, criticism or review; then author and source
must be cited
Matching data with documentation
• Do the data match the documentation
• Is any data missing?
• If anything looks wrong go back to the
depositor
• Don’t never amend data without checking
with data depositor/owner first!
Do No Harm!
Useful documentation for users
• Questionnaire, field work procedures
• Interview schedule or topic guide
• Observation or diary templates
• Stimuli e.g. scenarios, photos, images
• Field notes
• Outputs e.g. reports
• Details of any processing e.g. digitisation or
new derived variables /measures
• Information sheet and consent agreement
• Errata
The value of the ‘ReadMe’ file
Good practice for each data collection
• For each filename a short description of data is included
• Capture relationships between the data files
• For tabular data definitions of column headings and row
labels, data codes (including missing data) and
measurement units
• For textual data a data list of all interviews, focus
groups, etc.
Descriptive metadata
Record for discovery – Open license; harvestable via OAI-PMH
Persistent identifier – e.g. Datacite DOI
Controlled vocabularies
Standardised schema for data description
Dublin Core and DataCite Core
Data Documentation Initiative (DDI) – good for research data!
Text Encoding Initiative (TEI) – good for text markup
Use Extensible Markup Language (XML)
Your turn
• Documentation quiz – discuss briefly with the person
next to you
Promoting data
Catalogue metadata record – harvestable into
meta catalogues
Use of DOIs that can be used and citations
found easily on the web
Promotion through news, social media and
events
Be creative….codesign and experiment
Advocacy for data citation!
Data papers
Key skills that enhance research training
Policy landscape and data sharing
Writing and implementing a data management plan
Documenting and contextualising data
Formatting and organising data
Storing and transferring data, encryption and security
Legal, ethical issues in handling and sharing data –
consent, anonymisation and access control
Rights relating to research data
Publishing and citing research data
Data Papers – growing trend
• Encourage data owners to publish a data paper
• Enables formal citation of data - gives credit to data
managers and scientists
• Data housed in a trusted repository with own DOI
• e.g. Nature Scientific Data (http://www.nature.com/scientificdata/)
What about ‘big data’?
• Scientific repositories already do this, e.g. astronomy
• Steps for smaller more traditional archives to ‘scale up’
• While principles are the same – it’s only data! – certain
issues are challenging
Unconsented for research
Often commercial rights
Unknown provenance - hard to verify and document
Often no version control
Web-based sources may ‘disappear’ or access
refused
Data might change dramatically
More about how we are building
capacity for big data at UKDS later!
Effecting a DMP – the researcher
• Discuss data archiving and sharing with research
participants to gain their consent for data sharing
• Anonymise data where needed
• Document and contextualise data for future reuse:
– information embedded in data files, e.g. variable labels,
value labels, codes and descriptions
– final report may contain the majority of contextual and
methodological documentation for data
– publications, working papers, lab books, code books
• Recommended formats for preservation and sharing
• Quality control checks
• Copyright permissions for data ownership
Take away planning issues
Know the legal, ethical and other obligations
towards research participants, colleagues, research
funders and institutions
Know the institution’s policies and services: storage
and backup strategy, research integrity framework,
rights policies, institutional data repository
Assign roles and responsibilities to relevant parties
Incorporate data management into research lifecycle
Implement and review management of data during
project meetings and review
Think ahead, plan for the future
Effecting a DMP – The Institution
Minimum • Advice to the researcher on funder requirements
and costing for planning & sharing
• Check research applications against a DMP
• Provide secure data storage during research
If have institutional repository provide:
• Visibility - metadata record
• Long-term data preservation
• Data dissemination and access
If not, help refer to a trusted Data Centre
Your exercise
Handout: Exercise: Data Management Planning
• You are to help a researcher create a DMP for his project
• In pairs, identify data management aspects that matter for
this research proposal and that should be included in the
DMP. Consider topics:
• File formats and organisation
• Data documentation, standards & metadata
• Quality assurance
• Data storage, security and backup
• Data access and (re)use, incl. restrictions and challenges
• IPR and data ownership
• Roles & responsibilities
Contact
Louise Corti
Collections Development and Producer Relations team
UK Data Service
University of Essex
UK CO43SQ
Resourcing the data lifecycle:
services, skills & infrastructure
Louise Corti
Director, Collections Development and Producer
Relations
UK Data Service, University of Essex
Workshop on Research Data Management
Conference ‘Changing Landscape of Science & Technology
Libraries’
Indian Institute of Technology Gandhinagar
2-4 March 2017
Covering
• Mission statements and scope
• Model data services and staffing
• Question for your own data service/archive
• Your own assessment of readiness
How to set up a data service
• Mission statement & statement of aspiration
– high-level scientific strategy and competencies
– country and discipline-specific researcher practices in
data sharing and re-use
– relevant legal frameworks
• Policy dialogue, appetite for scope
• Locating and building capacities
• Access to shared knowledge bases are useful
UK Data Archive mission
Promoting best practice in data curation
Raise standards in data management
Raise standards in data security
Drive archival innovation
Advance professionalisation of data service
infrastructures (leadership within the
profession)
Attracting, developing and maintaining
excellent staff
And we undertake R&D
• Core data services
– Curation, management, dissemination
• R&D projects
– Controlled vocabularies
– Infrastructure and tools development, e.g.
self-deposit system, online data browsing
– Safe settings services
– Data sharing practices
– Scaling up for big data
Defining scope of your collections
• Anticipate capacity – space and humans
• Draft a Collections Development Policy – an
evolving document
• Draft an Appraisal and Selection Policy
• Define publishing pathways
• Set up a Data Appraisal Group
• Is your repository FAIR?
FAIR principles for repositories
Findable
Accessible
Interoperable
Re-usable
Persistent identification of collections
https://www.force11.org/group/fairgroup/fairprinciples
OAIS adapted for a data service (ISO 14721)
Pre-Ingest
Access (Data)
(Support)
Staffing at UKDA
80+ staff – mostly supported by ESRC
• UK Data Service
• newer Administrative Research Data Network; Big Data
Network
5+2 main sections: • Resources and Management Services
• Collections Development and Producer Support
• Ingest and Access Services
• Technical Services
• Preservation Services
• Administrative Research Data & Big Data Teams
Multiple skills needed across Service
• Archiving and librarianship
• Data handling/manipulation
• Research expertise - user & producer community
• Metadata - cataloguing standards, controlled
vocabularies,
• IT systems - data management and storage
• Programming - maintenance and development
• Legal, ethical, security, rights expertise
• Finance, HR, management
• Digital preservation
Professionalisation
• Individuals – networks of organisations (IASSIST, RDA, Codata)
– continuous development and training
– professional qualifications and formal training
• Organisations – standards adherence
– governance
– audit, assessment, certification
– sit on decision-making forms across academia, government, funders and other data producers
Key questions for you
• What types of data are envisaged to be part
of your archive? Numeric or qualitative data?
Outputs and code?
• Who are the expected data producers
supplying data? Are they willing to provide
data in appropriate formats and with relevant
documentation?
Key questions for you
• Who are the expected end-users of the
data?
• Is there an expectation to provide analysis
tools for users?
• Will data storage need to be carried out at
multiple locations?
• What are the long-term expectations for
preservation of data?
Key questions for you
• Are there legislative requirements for data
protection/freedom of information to take into
account?
• Are there ethical issues about data sharing?
• Are all data likely to be anonymous?
• Are there specific information security
requirements?
• Will there be tiered access methods? (Open,
Safeguarded, Controlled?)
Capacity to run a data service
• How much capacity do you think you need?
• How much capacity do you have?
• Be realistic - choose RDM and data publishing
activities that are manageable - delegate what
you can to others
• Competition is not helpful here – carve your
niche and federate! Strength in numbers
Aspirations
• Secure funding!
• FAIR, Data Seal of Approval, World Data System
• Lots of beautiful data
• Many happy users
• And smiling bosses!
17
Final exercise
• Hand out: Exercise: Resourcing your institution
for data management and curation
• Write an aspirational statement for your data
service
• Rate your institution’s data sharing readiness
Tools to understand data practices
• Data Asset Framework
• identify which data assets exist in organisation, condition
and format, responsibility and long-term custody
• Digital Repository Audit Method Based on Risk Assessment
(DRAMBORA)
• evaluate and manage risks that threaten data or the
infrastructures they may rely upon
• CARDIO
• assess capabilities to support research data management,
and contribute to an institution-wide agenda for change
• Trusted Repositories Audit & Certification (TRAC)
• audit, assess and certify digital repositories
Guidance materials
UK Digital Curation Centre Guides
• How to Develop RDM Services - a guide for HEIs
• 5 Steps to Research Data Readiness: IT managers
• 5 steps to Developing a Research Data Policy
• (Longer version)