curating and managing research data for re-use appraisal & acquisition jared lyle
TRANSCRIPT
Curating and ManagingResearch Data for Re-Use
Appraisal & AcquisitionJared Lyle
We Are Here: Appraisal & Acquisition
Appraisal & Acquisition
• Collection development policy• Appraisal• Selection • Acquisition
http://en.wikipedia.org/wiki/File:Schnorr_von_Carolsfeld_-_Die_Schlacht_Rudolfs_von_Habsburg_gegen_Ottokar_von_B%C3%B6hmen.jpg
Collection Development Policy
Identifies:•Archive’s user base•Types of data in which archive is interested•Criteria used to determine archival value
Types of Data of Special Interest to ICPSR• Diversity Data. Data that fosters
understanding of the experiences of racial and ethnic minorities and other marginalized peoples living in the United States.
• Complex Data. Data arising from longitudinal research, survey research, and non-standard types: biological data, administrative records, video data, spatial data, remotely sensed data, and relational databases.
http://www.icpsr.umich.edu/icpsrweb/ICPSR/curation/selection.jsp
Types of Data of Special Interest to ICPSR• Mixed Method Data. Data that can support both
qualitative and quantitative analyses; data resulting from concurrent (both at the same time), sequential (one following the other), or conversion (one method to the other) mixed method study designs.
• Interdisciplinary Data. Data from interdisciplinary studies, and data resulting from studies using the research methods of multiple disciplines.
• International Data. Data originating outside the United States and data that support cross-national, comparative research. We are especially interested in data from countries and regions of the world that do not have a national structure for archiving, disseminating, and preserving research data.http://www.icpsr.umich.edu/icpsrweb/ICPSR/curation/selection.jsp
ICPSR Criteria for Archival Value
• Nationally representative • Theoretically/methodologically unique• Representing underrepresented research
populations• Widely cited, appearing in top tier journals, or
collected by an eminent scholar
Example Policies
• ICPSR Collection Development Policy http://www.icpsr.umich.edu/icpsrweb/ICPSR/org/policies/colldev.jsp
• UK Data Archive Collections Development Policy http://www.esds.ac.uk/news/publications/UKDACollectionsDevPolicy.pdf
• MSU Libraries Collection Development Policy Statement: Data Services http://libguides.lib.msu.edu/dataservicescollectiondevpolicy
A collection development policy is a living document and should be updated over time to follow the trends and output of the research community.
Example: Twitter @LOC
http://www.niemanlab.org/2012/07/that-plan-to-archive-every-tweet-in-the-library-of-congress-definitely-still-happening/
“Every public tweet, ever, since Twitter’s inception in March 2006, will be archived digitally at the Library of Congress.”
http://blogs.loc.gov/loc/2010/04/how-tweet-it-is-library-acquires-entire-twitter-archive/
2010 = 50 million tweets per day2012 = 400 million tweets per dayhttp://www.niemanlab.org/2012/07/that-plan-to-archive-every-tweet-in-the-library-of-congress-definitely-still-happening/
“It’s critical the future generations know what flavor burrito I had for lunch.”
-first comment on the Library’s project FAQ pagehttp://www.niemanlab.org/2012/07/that-plan-to-archive-every-tweet-in-the-library-of-congress-definitely-still-happening/
http://www.nydailynews.com/news/national/shooting-dark-knight-rises-aurora-colorado-unfolds-real-time-twitter-witnesses-theatergoers-social-media-theater-massacre-article-1.1118345
“Research requests [of the Twitter archive] have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language.”
https://www.conftool.net/or2012/index.php?page=browseSessions&form_session=2
“…if you’re looking for a place where important historical and other information in digital form should be preserved for the long haul, we’re it!”
http://blogs.loc.gov/loc/2010/04/how-tweet-it-is-library-acquires-entire-twitter-archive/
http://www.loc.gov/acq/devpol/
Example: Twitter @ICPSR
http://wewillraakyou.com/2010/09/under-twitters-hood/
“We estimate the entire raw data will be about 15 TB and after processing and extraction, it may be less than 2~5 TB. Since we do not know at this point how large the data will be, it would be helpful if you can let us know ICPSR's upper bound on manageable data size so that we can quote that in the supplementary material for our initial proposal. Thank you.”
Example: Transactional Data @ICPSR
http://blog.mipimworld.com/wp-content/uploads/2012/01/Target-checkouts.jpg
Discussion
• What data are you pursuing?• What do you do if you are offered things you
don’t want?• What new forms of data do you anticipate
working with in the next year?• How will that affect your collection
development strategy and policy?
Possible New Forms of Data• Continuous location information from cell phones or Fastlane
transponders.• Product radio-frequency identification (RFIDs), online product
searches and purchases, and device fingerprinting. • Electronic medical records, and new devices for continuous
monitoring, passive heart beat measurement, movement indicators, skin conductivity
• Satellite imagery. • Social everything—networking, bookmarking, highlighting,
commenting, product reviewing, recommending, and annotating.• Online games and virtual worlds.
Gary King (http://www.sciencemag.org/content/331/6018/719.full)
Selection• Passive• Active• Serendipitous
Source: Pienta, Gutmann, & Lyle, 2009
Why are data not shared?
• Preparing data and documentation can be enormously time consuming
• Limited resources for data preparation• Need to protect the confidentiality of
respondents• Fear of getting “scooped”• Lack of rewards for sharing
Source: Pienta, Gutmann, & Lyle, 2009
Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data”. http://hdl.handle.net/2027.42/78307
Pienta, Gutmann & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Shared?” Presentation at the Research Conference on Research Integrity, Niagara Falls, NY
More about data sharing:
Pienta, Gutmann, Hoelter, Lyle, and Donakowski (2008). “The LEADS Database at ICPSR: Identifying Important ‘At Risk’ Social Science Data.”
http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
Example: Selection @ICPSR
Discussion
• What are you doing to actively build your collection?
• How are you creating serendipitous selection?
Acquisition Goals
Transfer:•Content•Metadata•Legal permissions
Legal permissions
• Do they have authority to deposit the content with you?
• Can you then modify, reformat, preserve, describe, and redisseminate?
• Are there any human disclosure issues?
ICPSR’s Deposit Agreement
• I have implicit or explicit copyright to this work and have the right to make it publicly available through ICPSR.[red highlights added by me]
ICPSR’s Deposit Agreement• I give my permission for the Data Collection to be used by
ICPSR for the following purposes, without limitation:– To redisseminate copies of the Data Collection in a variety of
media formats– To promote and advertise the Data Collection in any publicity (in
any form) for ICPSR– To describe, catalog, validate and document the Data Collection– To store, translate, copy or re-format the Data Collection in any
way to ensure its future preservation and accessibility– To incorporate metadata or documentation in the Data
Collection into public access catalogues• I give my permission to ICPSR to enhance, transform and/or
rearrange to the Data Collection, including the data and metadata, for any of the following purposes:– Protect respondent confidentiality– Improve usability
[red highlights added by me]
ICPSR’s Deposit Agreement• To the extent allowable by law or permitted by the
sponsor of the data collection, in preparing this data collection for public archiving and distribution, I have removed all information directly identifying the research subjects in these data, and I have used due diligence in preventing information in the collection from being used to disclose the identity of research subjects.
• I further agree to release and hold harmless ICPSR (including staff and the ICPSR Council) and the University of Michigan from any and all liability from claims arising out of any legal action concerning identification of research subjects, breaches of confidentiality, or invasions of privacy by or on behalf of said subjects.[red highlights added by me]
Deposit Mechanism
http://www.gizmodo.com.au/2010/08/rube-goldberg-the-man-behind-the-machines/
Deposit Mechanism
Atari’s “Star Trek” instructions:
Insert Quarter. Avoid Klingons.
-See Isaacson’s Steve Jobs
Example: Deposit Form @ICPSR
Pre-2007
2007
2010 Mock-up
Example: Deposit Form @DeepBlue
Example: Deposit Form @Dryad
http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/deposit/index.jsp2012
Deposit: Behind the Scenes
• Checksum• ID verification• File type verification• Data transferred to secure storage
Discussion• How easy or complex should your deposit
process be?• What are incentives you can use to encourage
depositors to do a thorough job?• What legal issues do you address at deposit?
Other issues: Formats
DSpace @ MIT:•Supported: DSpace fully supports the format.•Known: DSpace can recognize the format, but we cannot guarantee full support.•Unsupported: DSpace cannot recognize a format; such formats are listed as "application/octet-stream", or Unknown.
http://libraries.mit.edu/dspace-mit/build/policies/format.html
http://techaticpsr.blogspot.com/2012/05/april-2012-deposits-at-icpsr.html
File Types Deposited @ICPSR - April 2012
Other issues: Length of Commitment
How long (and at what level) do we commit to preserving data?•Forever? •10 years?•5 years after the last access?
We Are Here: Appraisal & Acquisition