curating and managing research data for re-use review & processing jared lyle
TRANSCRIPT
Curating and Managing Research Data for Re-Use
Review & ProcessingJared Lyle
We Are Here Today: Review & Processing
http://weknowmemes.com/2011/12/this-is-my-room-what-i-think-it-looks-like-what-my-mom-thinks-it-looks-like/
A well-prepared data collection “contains information intended to be complete and self-explanatory” for future users.
Do no harm.
http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf
Review
• Documentation• Data• [Disclosure Review]
Is the data collection complete, accurate, and well-documented?
Documentation
http://dx.doi.org/10.3886/ICPSR31521.v1
Essential Descriptive Elements
• Basic front matter• Variable level details• Methodology
Documentation: Front Matter
Title
Principal Investigator(s)
http://dx.doi.org/10.3886/ICPSR31521.v1
Description
Documentation: Front Matter
Monitoring the Future: A Continuing Study of American Youth (12th-Grade Survey), 2009. Johnston, Lloyd D., Jerald G. Bachman, Patrick M. O'Malley, and John E. Schulenberg. Monitoring the Future: A Continuing Study of American Youth (12th-Grade Survey), 2009 [Computer file]. ICPSR28401-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2010-10-27. doi:10.3886/ICPSR28401.v1
Documentation: Variable-level Details
National Longitudinal Study of Adolescent Health (Add Health), 1994-1995 (National Longitudinal Study of Adolescent Health (Add Health), Wave I School Administrator Codebook. http://www.cpc.unc.edu/projects/addhealth/codebooks/wave1/index.html
Variable Name
Documentation: Variable-level Details
Variable Label
Documentation: Variable-level Details
Variable Type
Documentation: Variable-level Details
Question Text
Documentation: Variable-level Details
Values
Documentation: Variable-level Details
Value Labels
Documentation: Variable-level Details
Missing Data
Documentation: Variable-level Details
Summary Statistics
Documentation: Variable-level Details
Constructed Variables
Documentation: Variable-level Details
Documentation: Variable-level Details
Skip Patterns
Notes
Documentation: Variable-level Details(examples)
American National Election Study, 2008-2009 Panel Study Frequency codebook, version 20090903. http://electionstudies.org/studypages/2008_2009panel/anes2008_2009panel_fcodebook.txt
Documentation: Variable-level Details(examples)
Davis, James A., Tom W. Smith, and Peter V. Marsden. General Social Surveys, 1972-2008 [Cumulative File] [Computer file]. ICPSR25962-v2. Storrs, CT: Roper Center for Public Opinion Resarch, University of Connecticut/Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2010-02-08. doi:10.3886/ICPSR25962
Documentation: Variable-level Details(examples)
United States Department of Health and Human Services. Substance Abuse and Mental Health Services Administration. Office of Applied Studies. National Survey on Drug Use and Health, 2009 [Computer file]. ICPSR29621-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2010-11-16. doi:10.3886/ICPSR29621
Documentation: Variable-level Details(examples)
United States Department of Justice. Office of Justice Programs. Bureau of Justice Statistics. Capital Punishment in the United States, 1973-2008 [Computer file]. ICPSR27982-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2010-09-07. doi:10.3886/ICPSR27982
• Sample design: A description of how the cases that appear in the study were selected, including details about target populations, sampling frames, sample sizes, sampling errors, and sampling methods.
• Data collection procedures: The methods used to collect the data (e.g., telephone, mail, computer-assisted). Where applicable, this includes the exact instructions and protocols used by interviewers when they collected the data.
• Data processing: The activities and quality checks performed on the data collection to generate the final data products from the raw collected data. If files were merged , a full description of the process should be provided.
Documentation: Methodology
• Weighting: Where applicable, a description of the criteria for using weights in the analysis of a data collection, including how the weights were created, all weighting formulae or coefficients, a definition of their elements, and an indication of how the formulae are applied to the data.
• Confidentiality issues: Where applicable, a discussion of any confidentiality issues in the data, as well as the steps taken to mitigate disclosure risk.
Documentation: Methodology
Other Documentation
• Questionnaire• User Guide• Handbook• Manual• Report• Table• User Agreement• Errata
Useful Resources: DescriptionICPSR, “What is a codebook?” http://www.icpsr.umich.edu/icpsrweb/ICPSR/support/faqs/2006/01/what-is-codebook
Institute for Health and Care Research Quality Handbook http://www.emgo.nl/kc/preparation/data%20collection/3%20Codebook.html Princeton University Data and Statistical Services, “How to Use a Codebook” http://dss.princeton.edu/online_help/analysis/codebook.htm UCLA Social Science Data Archive, “Codebooks” http://dataarchives.ss.ucla.edu/tutor/tutcode.htm
Data
Data Labels
• Does each variable have a variable name and label?
• Do all categorical variables have value labels?• Are labels consistent?
Naming Conventions: Variables
Variable Names:
•One-up numbers (V1, V2)•Question numbers (Q1, Q2)•Mnemonic names (age, race)•Prefix, root, suffix systems (FAED, MOED)
Naming Conventions: Variables
Variable Labels:
•Item/Question number•Indicate variable content•Indicate if variable constructed
Q14: Assessment of R’s Health
Naming Conventions: Values
Value Labels:
•Mutually exclusive, exhaustive, and defined•Preserve original information•Retain original coding scheme
Respondent’s Employment StatusSelf-employed (1)Somewhere-else (2) No answer (9)Not applicable (BK)
Missing Data
• Are there missing data?• Are missing data labeled?
77 = Inapplicable88 = Don’t Know99 = No Answer
Values
• Are the values reasonable (for example, date variables contain dates, gender variables don't have 10 categories, variables aren't all system missing)?
• Are there weight variables? If so, are they well documented?
Matching Data & Documentation
• Do the data match the documentation? Are values and/or labels listed in one but not in the other?
• Are all codes in the data valid (documented) according to the data collection instrument or PI's codebook?
• Are there duplicate records?• Does the spelling look OK?
Processing History
Useful Resources: DataUK Data Archive, “Documenting Your Data/Data Level/Structured Tabular Data” http://www.data-archive.ac.uk/create-manage/document/data-level?index=1
ICPSR Guide to Social Science Data Preparation and Archiving: Phase 3: Data Collection and File Creation, “Documenting Your Data/Data Level/Structured Tabular Data” http://www.icpsr.umich.edu/icpsrweb/content/deposit/guide/chapter3quant.html
Activity
• Review the following data output and report any issues you find.
Examples of What to Look For:
42
43
Examples of What to Look For:
44
Examples of What to Look For:
45
Examples of What to Look For:
46
Examples of What to Look For:
47
Examples of What to Look For:
[Disclosure Review]
Discussion• How much cleaning do you do to a data
collection?• When is it appropriate to change the ‘original
order’ of a data collection?• How many processing details do you include
in the study documentation?
Example: Review @ICPSR
A well-prepared data collection “contains information intended to be complete and self-explanatory” for future users.
Do no harm.
We Are Here Today: Review & Processing