open science and data sharing: the datafirst experience/martin wittenberg
TRANSCRIPT
Open Science
Overview• Introduction• Data and the research ecosystem• The problem of measurement in the social
sciences• Difficulties with sharing data• Why sharing data is essential• The role of a data platform like DataFirst
Open Science
Introduction• I’m an economist trying to understand what has happened to South
Africa since the end of apartheid– Particularly in relation to wages, employment, inequality, service
delivery• Data and data quality are key
• I also direct DataFirst, which is an organisation based at UCT dedicated to making it easier for researchers to access social science microdata
• www.datafirst.uct.ac.za• https://sites.google.com/site/martinwwittenberg/home
Open Science
Data and the research ecosystem
• Data doesn’t just appear• The value and meaning of data arises from
how it emerges within the
Open Science
Data and the research ecosystem
Theory• e.g. how markets work
Application• e.g. the impact of
imposing a minimum wage in 2018
Measurement• e.g. Quarterly Labour
Force Survey• e.g. tax returns
Open Science
Measurement• Sometimes for research purposes• But also incidental to other purposes
– e.g. tax data, satellite “night light” data
• Understand context, rules and procedures used– Sampling theory– Measurement instrument (e.g. questionnaire)– Fieldwork practice– Post-fieldwork data capture & processing– Imputations for missing values
Open Science
Measurement in the social sciences
• Crucial to also understand what you are notseeing– Non-response
• In the social sciences the subjects of research often have an interest in the outcome– Choose what to report
Open Science
An example from my researchCompare earnings in tax data and surveys• Wages of
employees
Blog post at http://www.econ3x3.org/
Open Science
Measurement issuesThe picture when looking at earnings from self-employment (business profits)
Why?• Penalties for
not reporting• But accurate
reporting means paying more tax
Open Science
Data within the research ecosystem
• In summary, data is not useful for research unless– We know where it has come from– What sort of errors/biases are likely to be involved
in the measurement process• AND
– People who are working on applied questions know that it exists/can be accessed
Open Science
Difficulties with sharing data• One of the challenges of sharing data is to
provide enough information about– Context– Measurement process(Metadata)
• Plus the data must be stored in a way that it is “discoverable”
• All of this costs time and effort
Open Science
Other difficulties• Fear of getting scooped with one’s own data• Fear of someone else finding a path-breaking
application of the data that one hadn’t thought of• Fear of problems/errors in the measurement
process being exposed• Confidentiality/privacy of respondents
– Ethics clearance
Open Science
How might one deal with these?
• Getting scooped– Delay public release
• “Important Science” vs “Mere data gathering”– Underlying issue is really one of skill– Response is often “data squatting”/rent extraction– A more creative response is to find ways to get
training programmes up around the data
Open Science
Issues with sharing, cont.• Exposing problems with the measurement
process– Becomes more critical if these data are the only
ones available– Reality is that there is no 100% clean dataset– Provided that there is still a detectable “signal” in
the data, it can still be used for science• It becomes easier to “fix” the problems if they are
openly acknowledged
Open Science
Issues with sharing, cont.
• Confidentiality– “Open science” doesn’t mean that the data has to
be available on the web for anyone– Key issue is that there have to be transparent
protocols for access– e.g. “Secure Labs” as recently established in
DataFirst
Open Science
Why sharing is essential• Proper science
– Can only be done if results can be replicated– Errors in analysis/measurement exposed
• New insights– It is impossible for one team to be on top of all the ways in
which a dataset could be used– Making data available allows some of the best and brightest
people in the world to think about your issues/problems• e.g. much of our insights into the impact and effectiveness of South
Africa’s old age pension system came from American academics– Of course some garbage is likely to be generated in the process
too
Open Science
Why sharing is essential, cont.
• Improvement in skills– South African quantitative social scientists of my
generation learned most of what we know from seeing international economists (notably Nobel prize winner Angus Deaton) work on our data
• He showed that there are fascinating questions to be answered
• He made his code available
Open Science
How do we make sharing more successful?
• This is really a question not only about the incentives to researchers and research organisations
• But also about institutions that can facilitate this process
• Organisations like DataFirst play an important role here
Open Science
The issue is really how to strengthen the links
Theory• e.g. how markets work
Application• e.g. the impact of
imposing a minimum wage in 2018
Measurement• e.g. Quarterly Labour
Force Survey• e.g. tax returns
Overview
Replicability of results
Data Published Paper
Analysis
Review/ReplicationFollow-up
Skilled Researcher
Reader
Overview
Best practice data production
Data ProducerMethodological
Research“Best practice”
Practical Issues
Feedback
Open Science
How can we strengthen these loops?
• These are not “add-ons” – they are an integral part of a successful science infrastructure– Like libraries, research clouds etc.– Need to be supported:
• Financially• Mandates for sharing data, particularly if public funds
have been used in collecting them