data sharing, small science, and institutional repositories melissa h. cragin & carole l. palmer...
TRANSCRIPT
Data Sharing, Small Science, and Institutional Repositories
Melissa H. Cragin & Carole L. PalmerCenter For Informatics Research in Science and Scholarship
Grad. School of Library and Information Science, University of Illinois
Jacob R. Carlson & Michael WittPurdue University Libraries
A view from the Institutional Repository
Advancing university-based cyberinfrastructure is dependent on our understanding of how to support data practices and needs.
Sharing is at the heart of success: collecting, storing, and making use of data can only come after the means for sharing are in place.
We cannot collect and curate all data, particularly in a way that facilitates effective re-use. We will need to work with researchers to develop
selection and appraisal guidelines, and data services.
Data Curation Profiles Project
Project focus: which data are researchers willing to share, when, and with whom?
Objectives: derive requirements for managing data sets in IRs develop policies for archiving and access identify librarian roles & skill sets for supporting data
management, sharing & curation.
BiochemistryBiology
Civil EngineeringElectrical Engineering
Food SciencesEarth and Atmospheric Sciences
Soil Science
AnthropologyGeology
Plant SciencesKinesiology
Speech and Hearing Earth and Atmospheric Sciences
Soil Science
Methods
Institutional Review Board for approval of Human Subjects Research
increasingly focused, materials-based interviews Pre-interview Worksheet Requirements Worksheet
“data set” samples
Data Curation Profileshttp://www.datacurationprofiles.org/
Faculty Population for Initial Needs Assessment by Department
43
37
24
17
161413
12
10
10
8
7
7
7
7
7
66
55 5 5 4
Illinois State Surveys
No. Dept/s with <4 faculty
Natural Res & Env Sci
Civil & Environmental Eng
VeterinarySciences
Crop Sciences
Plant Biology
Architecture and Landscape Architecture
Agricultural Engineering
Geography
Geology
Agr & Cons Econ
Animal Sciences
Atmospheric Sciences
Food Science & Human Nutrition
Mechanical & Industrial Eng
Animal Biology
Waste Management Research Ctr
Anthropology
Electrical & Computer Eng
Materials Science & Engineering
Urban & Reg Planning
Chemistry
“Faculty of the Environment” Data Needs ProjectCollaborators: Bryan Heidorn, Michelle Wander, U of I Environmental Council
Smallish Science
single PI (often) often dependent on graduate students ad hoc data management systems idiosyncratic sharing practices “success” dependent on using one’s own data
But… may be working at community level may be producing all digital data may be conducting “data-driven” science may be producing very large data sets
Data Characteristics
Crystallography Geology
Type 1. “Raw data” Most information rich, long-term value for re-use
…4. “CIF file” – crystallography exchange
Most commonly shared data type
1. “Reduced spreadsheet” – table withaverage values for multiple observations
Most often requested by others
Format 1. Binary data – image4. Crystallographic Information File – text (field-wide standard for numerical data)
1. Excel spreadsheet
Size 1. Each image or “frame” ¼ to 1 Mb Set is approx. 2,400 frames = approx 1Gb4. > 500Kb
1. spreadsheet size – under 1Mb
Intellectual Property/Data Owners
Service model provide a service to chemists by solving crystal structures
Ownership of the data is ambiguous, and require negotiation before data “hand-off”
Depends on source of funding governmental and private grants, gov. institutions, industry
Ownership of and right to the data range from full
to very limited, some long-term “embargoes”
Accessibility Field-wide repositories Many journals require deposit of CIF files OAI-PMH tools becoming available for CIF files
Difficult and ad hoc Well-known researchers receive direct requests for data, often based on publications
Profiling complexities & differences
Findings
Distinguishing exchange from open sharing exchange: sharing amongst collaborators is a primary
concern, often with significant barriers (more) open access: limited by need for control and
reward system, but also
Sharing with wider “publics” is conditioned by both data management pressures and personal experience the “known person – cost” algorithm incidents of misuse
What is most easily or willingly shared is not always the data that has the most re-use value
Field
Specific Research
Area Form to be shared FormatsType of data set Size
Shared when?
Atmospheric science
severe weather modeling
compressed output of the model Vis5D
1 file / dataset 10-100 Mb
4-6 month embargo,
Agronomy
water quality, drainage, and plant growth
cleaned and reviewed sensor and hand-collected sample data .xls
approx. 100 files
~1MB each, up to 20 Mb
After publication
Geologyrock, water and microbes
averaged sensor and hand-collected sample data; photographs .xls; jpg
1 file; images < 1 Mb
After publication
Civil Engineering
traffic movement
cleaned and normalized sensor data
MySQL (postgresql)
1 database
approx. 1000 K/day
1 month to 1 year embargo
Examples of what, and when
Implications for Institutional Repositories
embargo services are a *must* (~66%, 14/20)
clear, explicit data citation information in IR records
disconnect: application of metadata standards highly important, but many unaware of existing standards
preservation services are needed to support re-use: 11/19 participants said their data would be useful for more than 10 years.
Supporting the science process
data exchange infrastructure
support for data management planning
data literacy instruction - integral to scientific information work
Broader implications for academic institutions
Leadership Opportunities for Libraries