what have scientists planned for data sharing and reuse? a content analysi…

27
What have Scientists Planned for Data Sharing and Reuse? A Content Analysis of NSF Awardees’ Data Management Plans Renata Curty, Youngseek Kim & Dr. Jian Qin Baltimore, 4-5 April 2013

Upload: recurty

Post on 04-Jul-2015

207 views

Category:

Education


1 download

DESCRIPTION

Presentation at the Research Data Access and Preservation Summit (RDAP2013) - Baltimore, MD, 4-5 April 2013.

TRANSCRIPT

Page 1: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

What have Scientists Planned for Data Sharing and Reuse?

A Content Analysis of NSF Awardees’ Data Management Plans

Renata Curty, Youngseek Kim & Dr. Jian Qin

Baltimore, 4-5 April 2013

Page 2: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Motivation

While the NSF mandate gives researchersplenty flexibility to define their own DMPand many academic institutions provideDMP writing support, little is known abouthow scientists address their strategies ontheir DMPs.

Page 3: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Study Design Online Survey: 20 questions

Target Population: NSF Awardees from January 18, 2011 to November 5, 2012 - Standard Grants - Total 16065

Random Sample: 1606 cases

Pilot Study: 100 Awardees (Survey Reformulation)

Final Deployment: 966 awardees, 169 responses (17.5%) and DMPs (68)

Page 4: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

NSF Directorate Amount Awarded

166 166

10%

16%

12%

18%

16%

15%

13%

BIO CISE EHR ENGGEO MPS SBE

Awards Info

Page 5: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Awardees InfoAge Organization Type

7%

41%

26%

19%

7%

25-24

35-44

45-54

55-64

65+

150 151

Academia, 93%

Page 6: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Awardees InfoPosition in Academia

Others: Dean (3), Professor Emeritus (1), Professor of Practice (1), Lecturer/Instructor (1), Post-Doctoral Fellow (1), Emeritus Senior Scientist, Director, Expert Consultant, Administrative Faculty Position, Chair.

143 138

Assistant Professor

22%

Associate Professor

28%

Full Professor

40%

Researcher6.77%

Tenured62%

Retired 2%

On Tenure Track25%

Non-Tenure Track11%

Page 7: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Geographical Distribution

109Created with Google Fusion Tables.

Page 8: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

4.79

%0.

40%

3.01

%

22.7

5%

21.5

6%10

.24%

11.3

8%

13.7

7%

6.63

%

25.7

5%25

.75%

10.8

4%

23.3

5%23

.35%

22.8

9%

8.98

%10

.18%

33.1

3%

2.99

%2.

99%

13.2

5%

Strongly disagree Disagree Somewhat disagree

Neither agree or disagree Somewhat agree Agree

Strongly agree

DMP is difficult to execute

DMP is important to formalize data sharing practices in science

N=166= 4.93= 1.62

Writing a DMP for NSF proposal is a challenging task

N=167= 3.89= 1.45

N=167= 3.79= 1.51

Page 9: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Others: Computational Models, Surveys, DNA Sequences, Computer Codes, Crowdsourcing Data (Reviews)

Types of Data Documentation of Data

Will follow:

46% - Disciplinary practices

37% - Research project’s needs

17% - Institutional recommendations/ guidelines

158

3D Models 13.01% - 19

Audio Files 12.33% - 18

Curriculum Materials 21.23% - 31

Data Models 27.40% - 40

Field Notes 26.03% - 38

Experimental Data 63.70% - 93

Images 36.99% - 54

Interview Transcripts 17.12% - 25

Patient Records 0.68% - 1

Samples 20.55% - 30

Software 35.62% - 52

Spreadsheets 40.41% - 59

Video Files 21.23% - 31

Page 10: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Challenges Encountered

None26%

Lack of guidance from my

institution29%

Lack of guidance from NSF

36%

Appropriate infrastructure

to archive/preserve data

41%

Level of granularity

of data25%

Data

Description & Documentation

30%

Which stage(s) of

research to share the

data 25%

Others:

Some projects do not generate data

Conflict between DMP requirement and IRB requirements regarding social and behavioral research data

Conflicts intellectual property and data protection

Long-term preservation issues

Conflicts individual/group vs. institutional strategies

169

Page 11: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Data Access & Availability

167

Others: “Publications”, “Available to NSF only”

Open 45%

Available with some

restrictions51%

Restricted5%

By email request 45.52% - 61

Personal website 17.91% - 24

Research Group/Project Website

51.49% - 69

Institutional Repository 20.15% - 27

Disciplinary Repository 32.84% - 44

Page 12: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

164

Barriers for Data Reuse

Page 13: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Reuse Issues - Privacy, Anonymity & Confidentiality

“IRB restrictions on ability to share even deidentified data. Concern that sharing even deidentified data will discourage participation in the study.”

“For myself, no. But for others to use my data, yes: for qualitative data, under IRB requirements for the protection of human subjects around confidentiality and anonymity, DMPs are nearly impossible to implement without perhaps some kind of temporal restriction on them (like, ‘This archive can only be opened in 20 -30 - 40 years’ or something like that)”

“The project involves human subject; so protections have to be put in place that may limit reuse applications in the future.”

“HIPAA *Health Insurance Portability and Accountability Act+ issues - obtaining self reporting data on human subjects.”

Page 14: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Reuse Issues - Context, Time Factor & Documentation“My past data was collected on a unique system built specifically for the research project. Need lots of context to reuse the data.”

“The only problems I see is that data can be taken out of context in a way that produces results that might not be correct.”

“Data is specific to testing scenarios. The insight gleaned from our experimental data is of more importance than the data itself.”

“My data is for specific purposes and it is hard to conceive of how someone would use it for something else/different. Even with a significant amount of metadata it would be difficult for someone to know all the circumstances under which the data was collected and why it was collected.”

“All scientific data is collected in particular context. Mechanisms that facilitate the description of that context are lacking. The creation of metadata that provides this information is a cumbersome, boring task and there are few resources available to ease the burden.”

Page 15: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

“Systems are always changing...It would be best if we could upload data to NSF so that it will be publicly available in the same way NIST [National Institutes of Standards and Technology+ publishes data.”

“Our raw data formats are extremely large, and need to be compressed into reduced, on-line archives for sharing. It is not possible for me as an individual PI to archive the raw data for others to examine.”

“My data is generally related to large software artifacts, so using it could involve quite a bit of work to get those artifacts running. This is something that I explicitly try to come up with solutions for in my DMPs.”

“Until NSF provides a free national repository for data archiving, we will not make progress in this area. If such an archive was available, it would be sensible to require researchers to place data there at the end of a grant and would allow other researchers to take advantage of it in a practical way.”

Reuse Issues - Format, Tools, Infrastructure Interoperability & Standards

Page 16: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

DMPs – Preliminary Content Analysis

• Coding Scheme

Used both deductive and inductive approaches

35 codes

NSF DMP Policy and University of Virginia's Guideline

Emerged from DMP statements

• Data Analysis Procedure

A total of 766 utterances were identified

642 unique utterances

Page 17: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

DMPs’ Content

<Wordle Cloud Generated Based on Numbers of Each Code across the 68 DMPs>

Page 18: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Coding Scheme

Types of Data

Metadata Standards

Data Access & Sharing Process

Data Archiving

Plan

Data Reuse Plan

Others

• What to Generate

• What Data Types

• How to Create• Where to Get

Existing Data

• Data Format

• Metadata Form

• How to Create

• Which

Metadata

Standard

• Contextual

Details Needed

• Discoverability

of the Data

• When Available• How Available• What Available• Process for

Gaining Access• How Long

Retain the Right• Embargo Period• Ethical/Privacy

Issues• Compliance

with IRB Protocol

• Whose Intellectual Property

• Reusability of the Data

• Restrictions to Access

• Groups Interested In

• Foreseeable Uses/Users

• Strategy for Archiving Data

• Which Repository

• Procedures for Long-Term Storage

• Data Preservation Period

• What Data Preserved for Long-Term

• Transformation Required

• Data Documentation

• Related Information

• Data Lifecycle• Data Curation• Budget

Page 19: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Types of Data

Codes Freq. Examples

What to Generate 58 Geochemical Data, Physical Samples, Mathematica(programing) Code, Course Materials

What Data Types 37 Gene Sequences, Experimental Data, Interview Transcript, Video Recordings

How to Create Data 25 Experimental Setup, Field Observation, Simulation, Survey, Interviews

Where to Get Existing Data 13 Moore Laboratory of Zoology, ArcView/GIS Inventories, Prior Study’s Database

Metadata StandardCodes Freq. Examples

Data Format 38 CSV file, TEMPO data file, XML format, SPSS file, plain text

Metadata Form 31 ArcGIS Metadata file, XML-base standard file, GIS database file

How to Create Metadata 14 Use existing metadata standards, or develop their own metadata standards

Which Metadata Standard 15 Dublin Core, DNA Sequence Metadata, EML (Ecological Metadata Language)

Contextual Details Needed 10 All aspect of the development project documented, experimental procedure record

Data Discoverability 7 Searches Built into Library, Searchable through Project Website

Page 20: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Data Access & Sharing Process

Codes Freq. Examples

When Available 28 Post-Publication, Post-Project, After Data Collection

How Available 37 Upon Request, Project Website, GMOD CHADO databases, Institutional Repository

What Available 33 Original research data (genome assemblies), survey data, educational materials

Process for Gaining Access 25 Email Request, Material Transfer Agreement, Direct Access from Web or Repository

How Long Retain the Right 18 Withhold until Publication, Years after Project Ends, Years after Data Production

Embargo Period 5 Years after data collection, Period for commercialization

Ethical/Privacy Issues 21 Privacy information is not available for public

Compliance with IRB Protocol 13 IRB application submission for human subject research

Whose Intellectual Property 17 Property of the PI and Co-PIs, Institutions, Open-Access

Page 21: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Data Archiving

Codes Freq. Examples

Strategy for Archiving Data 31 Hosted on the Web Servers at (university), ICPSR, disciplinary data repository

Which Repository 55 Organization website, institutional or discipline data repository

Procedures for Long-Term Storage

33 Submitted to databanks including NCBI GEO, Genbank, DataONE, Dryad

Data Preservation Period11 Minimum of five years post-grant funding, Long-

term preservation through disciplinary data repositories

What Data Preserved for Long-Term

7 All data and materials generated by this award, Genome Sequencing Data

Transformation Required 4 Keeping raw image data in its uncompressed form,transferred to IRI format

Data Documentation Submitted 11 Contextual details about experimental procedures, all aspects of the development project

Related Information Submitted 3 Metadata files, proposed study information, companion web page

Page 22: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Data Reuse PlanCodes Freq. Examples

Reusability of the Data 6 Descriptions about reusable methods (Used by a research community to follow-up)

Restrictions to Access 6 Access allowed for a certain group of researchers

Groups Interested In 8

Wider research community studying the Great Lakes, academic geography organizations, and geography teacher associations

Foreseeable Uses/Users 10

Available to engineers, clinicians, and medical researchers, sociologists and psychologists working in relevant sub-fields.

OthersCodes Freq. Examples

Data Lifecycle 1 Application of the Life Cycle Inventory databases

Data Curation 4 Curation (Consortiums and Partnerships)

Budget 9 Institution will absorb costs, no incremental costs , marginal costs

Page 23: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Data Available -

3 3

10

3

8

1

27

13

0

5

10

15

20

25

30

After data collection

After project

ends

After publication

Years after data

collection

Years after project

ends

Years after publication

Not Specified

Not Mentioned

Page 24: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Types of Data Repositories for Long-Term Archiving

11

4

14

11

2

13 13

0

2

4

6

8

10

12

14

16

Disciplinary Repository

External/Commercial

Storage

Institutional Repository

Internal/Institutional

Storage

Journal Repository/ Supplement

Lab/Organization

Website

Not mentioned/

Specified

Page 25: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Some insights – DMPs’ Preliminary Analysis More informal/personal data sharing procedures rather than

formal/institutionalized data sharing and management plans

Most DMPs lacks content on “Metadata Standard” and “Data Reuse Plan”

Few have plans for long-term archiving. Very vague plans and ideas about long-term use of their data

Many DMPs addressed data archiving in institutional repositories that are not in existence yet, but expected to be created

A few DMPs mentioned interview transcripts will be available, but without addressing IRB issues

Page 26: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Future Directions

Survey a larger number of Awardees

More exhaustive coding analysis and in-depth exploration of the DMPs’ content

Analysis of DMPs to identify patterns, common challenges and best practices across and within different disciplinary communities

Page 27: What have Scientists Planned for Data Sharing and Reuse? A Content Analysi…

Thank you!

[email protected]

Let’s Go Orange!