willmers&king open con2016-ct-14.11.16

24
Challenges in Preparing and Sharing Open Data OpenCon 2016 Cape Town 14 December 2016 Michelle Willmers and Thomas King ROER4D Curation and Dissemination Manager CC BY

Upload: michelle-willmers

Post on 16-Jan-2017

146 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Willmers&King open con2016-ct-14.11.16

Challenges in Preparing and Sharing Open DataOpenCon 2016 Cape Town

14 December 2016

Michelle Willmers and Thomas KingROER4D Curation and Dissemination Manager

CC BY

Page 2: Willmers&King open con2016-ct-14.11.16

Research

On Open Educational Resources (OER)

for Development

• Imperative to establish empirical baseline research on OER in Global South• 86 researchers in 26 countries across 3 continents• Project ‘Open’ ethos manifests in Open Research strategy, bridging ‘Open’

silos

• Open content (typically used in a teaching and learning content) that can be reused, revised, remixed, redistributed and retained

• Made possible by open licensing, although increasing focus on differentiating implicit vs. explicit open content

• Focus on role OER can play in improving access to quality education• Focus on role project can play in building Global South Open Education

research capacity• Strong advocacy and activism component (NGO, CBO sectors – not only

career researchers)

Focus on empirical baseline manifests in focus on curatorial and publishing capacity within the research project. The project acts as publisher, providing greater agency and control (but presenting some challenges in terms of accreditation/reward).

Page 3: Willmers&King open con2016-ct-14.11.16

ROER4D Curation & Dissemination Strategy

• Provide a content management and publishing service to SP researchers and the Network Hub team in order to advance research capacity development efforts and increase visibility of outputs.

• Support Principal Investigators and SP researchers in editorial development of ROER4D outputs.

• Address infrastructure deficits and provide content management solutions (including content hosting) in a research community with uneven institutional support and capacity challenges.

• Ensure that the ROER4D legacy is freely accessible for reuse in line with international curatorial and publishing standards.

• Complement Network Hub Communications efforts in an integrated communications/dissemination approach.

Page 4: Willmers&King open con2016-ct-14.11.16

• Data sharing as component of open content focus.• Organising and profiling open content increases the potential for reuse and citation

(impact).• Well-organised, strategic research management and content organisation promotes

rigour in the research process.• Copyright vests with the author > data-sharing activity determined by their willingness

and capacity to engage.• Format and platform/tool agnostic.• Share openly by default on condition that it is valuable, legal and ethical

ROER4D data management principles

Page 5: Willmers&King open con2016-ct-14.11.16

Project archive (external)

Zenodo

Network Hub(Google, Vula)

ROER4D project data flow

Internal sharing and collaboration

External sharing and collaboration

Page 6: Willmers&King open con2016-ct-14.11.16

Five pillars of ROER4D data publication approach

Page 7: Willmers&King open con2016-ct-14.11.16

Step 1: Evaluate contractual framework, articulate strategy

Page 8: Willmers&King open con2016-ct-14.11.16

Step 2: Get researchers on board

Page 9: Willmers&King open con2016-ct-14.11.16

• Check ethics approval and consent• Ensure first-tier de-identification takes place prior to Network Hub transfer in order to

ensure research subject confidentiality • ROER4D agnostic in its approach (in terms of scale, format and technical

sophistication)• Challenges of varying researcher sophistication in terms of data collection and

presentation• Challenges of varying researcher sophistication in terms of technology employed to

capture, present, and analyse data

Step 3: Obtain source sub-project micro-data

Page 10: Willmers&King open con2016-ct-14.11.16

• Archive in Vula and UCT e-Research Centre secure institutional archive• Network Hub C&D team audits researchers’ submitted dataset

> What is the dataset comprised of?

> Are all the pieces there?

> What were the data collection processes, and do we have all the instruments to share?

> What languages are represented?

> Does something else like it exist?

> Who might it be of use to?

• Address file naming and format issues• Articulate sub-project-specific data management plan

Step 4: Network Hub curation and quality assurance

Page 11: Willmers&King open con2016-ct-14.11.16

• Scope and conceptualise the dataset > Which components of the project-generated micro-data are you ethically and

legally allowed to share?

> Which components of the project-generated micro-data will you invest

resources in curating and sharing?

> Which instruments will you include?

• Identify focus of data and points of sensitivity• Define appropriate second-tier de-identification approach

Step 5: Prepare data for publication

Page 12: Willmers&King open con2016-ct-14.11.16

• Generate metadata and dataset description (accompanying narrative)• Submit content to publisher (DataFirst)• Link to published outputs• Include description of process in research Methodology statements• Profile in project communications activity

Step 6: Publish

Page 13: Willmers&King open con2016-ct-14.11.16

Some lessons learned

Page 14: Willmers&King open con2016-ct-14.11.16

1. Openness increases rigour. Preparing data for publication promotes professional approach to research process.

2. Preparing data for publication exposes weaknesses in instrument design and research process.

3. Introducing C&D and data-sharing focus midway through a project poses many challenges, particularly in terms of ethical and consent components.

4. Data sharing drives focus on reproducibility, transforming traditional approach to crafting methodology statements.

5. The data preparation process takes time (approx. one week of researchers’ time in ROER4D context).

6. Obtaining balance between utility and adequate protection in de-identification of qualitative data is a challenge.

7. Openness is threatening to researchers in terms of exposing weakness in processes and perceived threat of losing publication advantage.

8. C&D and data sharing activity require support, capacity development and resourcing.

Page 15: Willmers&King open con2016-ct-14.11.16

Qualitative de-identification

Thomas King

Page 16: Willmers&King open con2016-ct-14.11.16

Terms and definitions

• De-identification – removing, eliding or replacing pieces of information that reveal research participants’ (possibly also referents’) identity.

• Anonymity – personal details are not gathered.• Confidentiality – personal details are not shared.• E.g. an anonymous survey contains no questions

about personal identifiers. A confidential survey does contain these questions, but will not share/publish them.

Page 17: Willmers&King open con2016-ct-14.11.16

The two pillars of open data sharing

Consensualethical

legal

Comprehensiblecoherentvaluable

Research Data Management & Open Data sharing

Page 18: Willmers&King open con2016-ct-14.11.16

The de-identification balancing act

First, do no harm

Remove as much as needed to ensure the confidentiality or anonymity of the

research participants.

Ensure that all ethical and consent processes have been adhered to.

Don’t go overboard

Remove as little as is ethical to ensure the richness of the data.

Take the unit of analysis as the guide – de-identify up to the Unit of Analysis.

E.g: If Study X compares two universities, you can safely remove all identifiers lower

than the university affiliation.

HOWEVER

Your data may be useful to others. The purpose of de-identification is to preserve confidentiality – don’t de-identify for the

sake of it

Page 19: Willmers&King open con2016-ct-14.11.16

Qualitative de-identification• De-identification located in the same ecosystem

as data cleaning and data validation – no clear line between data improvement and de-identification– Cleaning up typos– Standardising presentation and layout– Identifying unanswered questions (or additional

questions), mislabelled responses, etc.• Much of these also apply to quantitative data• Articulation of principles in RDM and description

of these processes included in metadata

Page 20: Willmers&King open con2016-ct-14.11.16

READ DATA

Coherence Format &

layout EditingFix typos &

identify anomalous data

1.

2.

3.

4.

5.

De-identifyingRemove

identifiers

ValidationIdentify and

account for missing data

ROER4D data interrogation

process

Page 21: Willmers&King open con2016-ct-14.11.16

NETWORK HUBPrincipal Investigator

Curation and Dissemination team

Communication and Evaluation consultants

SUB PROJECTS

ROER4D project structure

Using largely mixed-methods data (both quantitative and qualitative)

Page 22: Willmers&King open con2016-ct-14.11.16

ROER4D de-identification process

1. First-level de-identification by researcher– Removal of direct identifiers (names of

people/institutions/companies, ID numbers, etc.)– Important to ensure that raw data is not shared

2. Second-level de-identification by C&D team to catch remaining direct identifiers

3. In-depth sweep of the text to identify indirect identifiers– Meticulous, thorough, repeated reading of the text

• (which ties back to general data enhancement)

Page 23: Willmers&King open con2016-ct-14.11.16

Tricky situations

• Data collected in multiple languages– De-identification (particularly in qualitative data) far

more difficult – greater reliance on the researcher• Post-hoc consent process– Departments merge or close, participants retire or

disappear• Data collected by multiple researchers– Different collection strategies, adherence to interview

schedules, use/non-use of clarifying questions, etc.

Page 24: Willmers&King open con2016-ct-14.11.16

Open by design

• Help researchers write consent forms! Particularly for open data sharing.

• ‘Red flag’ clauses abound in template consent forms, including:– “will be used for research purposes only”– “data will be destroyed after use”– “only researchers will have access to the data”

• More open consent forms allow for data sharing but do not mandate it.