willmers&king open con2016-ct-14.11.16
TRANSCRIPT
Challenges in Preparing and Sharing Open DataOpenCon 2016 Cape Town
14 December 2016
Michelle Willmers and Thomas KingROER4D Curation and Dissemination Manager
CC BY
Research
On Open Educational Resources (OER)
for Development
• Imperative to establish empirical baseline research on OER in Global South• 86 researchers in 26 countries across 3 continents• Project ‘Open’ ethos manifests in Open Research strategy, bridging ‘Open’
silos
• Open content (typically used in a teaching and learning content) that can be reused, revised, remixed, redistributed and retained
• Made possible by open licensing, although increasing focus on differentiating implicit vs. explicit open content
• Focus on role OER can play in improving access to quality education• Focus on role project can play in building Global South Open Education
research capacity• Strong advocacy and activism component (NGO, CBO sectors – not only
career researchers)
Focus on empirical baseline manifests in focus on curatorial and publishing capacity within the research project. The project acts as publisher, providing greater agency and control (but presenting some challenges in terms of accreditation/reward).
ROER4D Curation & Dissemination Strategy
• Provide a content management and publishing service to SP researchers and the Network Hub team in order to advance research capacity development efforts and increase visibility of outputs.
• Support Principal Investigators and SP researchers in editorial development of ROER4D outputs.
• Address infrastructure deficits and provide content management solutions (including content hosting) in a research community with uneven institutional support and capacity challenges.
• Ensure that the ROER4D legacy is freely accessible for reuse in line with international curatorial and publishing standards.
• Complement Network Hub Communications efforts in an integrated communications/dissemination approach.
• Data sharing as component of open content focus.• Organising and profiling open content increases the potential for reuse and citation
(impact).• Well-organised, strategic research management and content organisation promotes
rigour in the research process.• Copyright vests with the author > data-sharing activity determined by their willingness
and capacity to engage.• Format and platform/tool agnostic.• Share openly by default on condition that it is valuable, legal and ethical
ROER4D data management principles
Project archive (external)
Zenodo
Network Hub(Google, Vula)
ROER4D project data flow
Internal sharing and collaboration
External sharing and collaboration
Five pillars of ROER4D data publication approach
Step 1: Evaluate contractual framework, articulate strategy
Step 2: Get researchers on board
• Check ethics approval and consent• Ensure first-tier de-identification takes place prior to Network Hub transfer in order to
ensure research subject confidentiality • ROER4D agnostic in its approach (in terms of scale, format and technical
sophistication)• Challenges of varying researcher sophistication in terms of data collection and
presentation• Challenges of varying researcher sophistication in terms of technology employed to
capture, present, and analyse data
Step 3: Obtain source sub-project micro-data
• Archive in Vula and UCT e-Research Centre secure institutional archive• Network Hub C&D team audits researchers’ submitted dataset
> What is the dataset comprised of?
> Are all the pieces there?
> What were the data collection processes, and do we have all the instruments to share?
> What languages are represented?
> Does something else like it exist?
> Who might it be of use to?
• Address file naming and format issues• Articulate sub-project-specific data management plan
Step 4: Network Hub curation and quality assurance
• Scope and conceptualise the dataset > Which components of the project-generated micro-data are you ethically and
legally allowed to share?
> Which components of the project-generated micro-data will you invest
resources in curating and sharing?
> Which instruments will you include?
• Identify focus of data and points of sensitivity• Define appropriate second-tier de-identification approach
Step 5: Prepare data for publication
• Generate metadata and dataset description (accompanying narrative)• Submit content to publisher (DataFirst)• Link to published outputs• Include description of process in research Methodology statements• Profile in project communications activity
Step 6: Publish
Some lessons learned
1. Openness increases rigour. Preparing data for publication promotes professional approach to research process.
2. Preparing data for publication exposes weaknesses in instrument design and research process.
3. Introducing C&D and data-sharing focus midway through a project poses many challenges, particularly in terms of ethical and consent components.
4. Data sharing drives focus on reproducibility, transforming traditional approach to crafting methodology statements.
5. The data preparation process takes time (approx. one week of researchers’ time in ROER4D context).
6. Obtaining balance between utility and adequate protection in de-identification of qualitative data is a challenge.
7. Openness is threatening to researchers in terms of exposing weakness in processes and perceived threat of losing publication advantage.
8. C&D and data sharing activity require support, capacity development and resourcing.
Qualitative de-identification
Thomas King
Terms and definitions
• De-identification – removing, eliding or replacing pieces of information that reveal research participants’ (possibly also referents’) identity.
• Anonymity – personal details are not gathered.• Confidentiality – personal details are not shared.• E.g. an anonymous survey contains no questions
about personal identifiers. A confidential survey does contain these questions, but will not share/publish them.
The two pillars of open data sharing
Consensualethical
legal
Comprehensiblecoherentvaluable
Research Data Management & Open Data sharing
The de-identification balancing act
First, do no harm
Remove as much as needed to ensure the confidentiality or anonymity of the
research participants.
Ensure that all ethical and consent processes have been adhered to.
Don’t go overboard
Remove as little as is ethical to ensure the richness of the data.
Take the unit of analysis as the guide – de-identify up to the Unit of Analysis.
E.g: If Study X compares two universities, you can safely remove all identifiers lower
than the university affiliation.
HOWEVER
Your data may be useful to others. The purpose of de-identification is to preserve confidentiality – don’t de-identify for the
sake of it
Qualitative de-identification• De-identification located in the same ecosystem
as data cleaning and data validation – no clear line between data improvement and de-identification– Cleaning up typos– Standardising presentation and layout– Identifying unanswered questions (or additional
questions), mislabelled responses, etc.• Much of these also apply to quantitative data• Articulation of principles in RDM and description
of these processes included in metadata
READ DATA
Coherence Format &
layout EditingFix typos &
identify anomalous data
1.
2.
3.
4.
5.
De-identifyingRemove
identifiers
ValidationIdentify and
account for missing data
ROER4D data interrogation
process
NETWORK HUBPrincipal Investigator
Curation and Dissemination team
Communication and Evaluation consultants
SUB PROJECTS
ROER4D project structure
Using largely mixed-methods data (both quantitative and qualitative)
ROER4D de-identification process
1. First-level de-identification by researcher– Removal of direct identifiers (names of
people/institutions/companies, ID numbers, etc.)– Important to ensure that raw data is not shared
2. Second-level de-identification by C&D team to catch remaining direct identifiers
3. In-depth sweep of the text to identify indirect identifiers– Meticulous, thorough, repeated reading of the text
• (which ties back to general data enhancement)
Tricky situations
• Data collected in multiple languages– De-identification (particularly in qualitative data) far
more difficult – greater reliance on the researcher• Post-hoc consent process– Departments merge or close, participants retire or
disappear• Data collected by multiple researchers– Different collection strategies, adherence to interview
schedules, use/non-use of clarifying questions, etc.
Open by design
• Help researchers write consent forms! Particularly for open data sharing.
• ‘Red flag’ clauses abound in template consent forms, including:– “will be used for research purposes only”– “data will be destroyed after use”– “only researchers will have access to the data”
• More open consent forms allow for data sharing but do not mandate it.