data archiving and processing
TRANSCRIPT
1
Data Archiving and Processing
Karine Sahakyan, MD, MPH
American University of Armenia
June 26-27, 2008
Caucasus Research Resource Centers – ArmeniaA Program of the Eurasia Foundation
2
Overview
• Introduction, general archiving theory and practices
• Data structures and data processing• Survey documentation• User guides
3
Course schedule
26th June • Introduction and orientation• General archiving theory and practices• Data processing, data file structures, deriving
variables• Practical exercises and coursework
4
Course schedule
27th June • Data linkage• Survey documentation• Thematic user guides
5
Introduction
Data
DataProject
DataProject
Documentation
DataProject
DocumentationTheory
Data User
DocumentationTheory
Project
Data User
DocumentationTheory
Project
Data storage and preservation (official archives)
Data User
DocumentationTheory
Project
Data processing and analysis (personal archives)
Data User
DocumentationTheory
Project
Supporting documentation
Data User
DocumentationTheory
Project
User guides
15
Sources
• Elder, GH et al (1993) Working with Archival data: studying Lives, London, Sage.
• Dale, A et al (1988) Doing Secondary Analysis, London, Unwin Hyman.
• UK Data archive (www.data-archive.ac.uk)
• Royal Statistical Society (www.rss.org.uk)
• ESDS (www.esds.ac.uk)
• SPSS to process and analyse data
www.spsstools.net/SampleSyntax.htm
• DI2007 and INTAS 2007 survey data sets and docs
16
Intended course outcomes
• 1: Understanding of the need to systematically archive and document data.
• 2: Ability to differentiate different data types and structures
• 3: Ability to use SPSS syntax files to perform data derivation, data merging and analysis
• 4: Appreciation of the benefits of user guides for prospective data users
17
Assessment
• User guide and codebook development for the DI 2007 on a specific thematic module. Testing ILO’s 1,2 and 4
• Analysis of DI2007 or INTAS survey using SPSS syntax files to derive variables / merge files / produce tables/statistics. Testing ILO 3.
18
Assessment
• Course work sessions are timetabled so you expected to start both projects during this course
• Projects should be submitted 2 weeks after the end of the course
• I will be available by email to assist with any questions in this 2 week period
19
Data Archives
20
Data preservation- why?
Scientific responsibility
Costs
Legal requirement
Future use (secondary analysis)
21
Data preservation- what?
Original digitised dataQuestionnaire forms (?)Explanatory documentation (purpose and
technical)Unique identifiers (for future linkage and
follow up)Data at risk of being lost
22
Data preservation- how?
Design surveys with preservation in mind (consent forms, anonymisation)
Use commonly used formats (eg SPSS)
Collate developmental reports (track changes)
Recognised archive sites (CRRC!)
23
Data preservation- threats
Initial user needs delay access
Ownership and copyright
Confidentiality, disclosure, ethics, data protection
Physical storage media
Logical (digital) storage format
Costs
Organisational change
Poor data infrastructure (funding and strategy)
24
Survey Data: ‘version’ control
Early (pre-cleaning) release
‘Final’ release
Additional variables (derivations)
Preserving the original codings through:
1. using syntax to process the original data
2. saving processed data with different file name
3. creating archive of derived data sets (possibly thematic)
25
Exercise
1. What factors constitute the major threats to data preservation in the South Caucasus?
2. Using your list of threats formulate a ‘best practice’ guide for the preservation of data which aims to safeguard the future of statistical data in the region.
26
Data file structures
27
Simple one-off cross-sectional
1. Simplest file structure
2. Data arranged in a case/variables matrix
3. Each case has a value on each variable
4. Each case has a unique identifier
5. Processing involves 1. Selecting sub-sets of cases
2. Selecting sub-sets of variables
3. Recoding original variables
4. Deriving new variables from existing ones
ID V1 V2 V3 V4 .. ..1 1 3 70 6 .. ..2 1 3 75 3 .. ..3 2 2 73 4 .. ..4 2 1 73 2 .. ..5 2 1 72 6 .. ..6 1 2 74 10 .. ..7 1 3 73 2 .. ..8 .. .. .. .. .. .... .. .. .. .. .. ..
29
Repeated cross-sectional
1. As above but the questionnaire, or a newer version of it is administered at different points in time (say annually)
2. Respondents are sampled anew
3. Data processing as above
4. Comparisons over time are macro not micro. ie. They represent aggregate change over time and not individual change.
Different respondents, same questions
ID V1 V2 V3 ..
1 1 3 72 ..
2 1 3 71 ..
3 2 1 74 ..
.. .. .. .. ..
ID V1 V2 V3 ..
1 2 2 79 ..
2 2 1 80 ..
3 1 3 83 ..
.. .. .. .. ..
T1 T2
31
Hierarchical cross-sectional
1. Similar to the above but now there is more than one structure in the data eg. Respondents within households.
2. The case/variable matrix is now ‘nested’ ie some data is for the HH and some for the individual (this can be in the same data file or can be in separate files)
3. Separate unique code numbers are needed.
4. Data Processing involves1. Accurate separation of different levels
2. Suitable linkage where appropriate
Hierarchical structure #1 (people in households)
HHID H1 H2 PNO P1 P2 .. ..
1 4 1 1 2 1 .. ..
1 4 1 2 1 2 .. ..
2 1 1 1 2 2 .. ..
3 2 2 1 2 2 .. ..
3 2 2 2 1 1 .. ..
3 2 2 3 1 2 .. ..
4 1 1 1 1 2 .. ..
4 1 1 2 2 1 .. ..
.. .. .. .. .. .. .. ..
Hierarchical structure #2 (episodes of employment)
PID EMPNO P1 P2 P3 ..
1 1 2 2001 2005 ..
1 2 1 2005 -8 ..
2 1 2 2002 -8 ..
3 1 2 1995 1998 ..
3 2 1 1998 2001 ..
3 3 1 2001 -8 ..
4 1 1 2004 2005 ..
4 2 2 2005 -8 ..
.. .. .. .. .. ..
34
Panel
1. Using the same sample of respondents over time
2. Questions are often also repeated at different surveys
3. Data structure can be a simple case/variable for each phase of data collection
4. Unique identification for each respondent which remains for the life of the panel needed.
5. Data processing1. Connecting variables for a single individual over successive
waves of the survey (micro data analysis)
Same respondents over time
T1 T3T2 T4
36
Cohort
1. Similar to Panel but each case is from a common cohort (where this is taken to be time related)
2. Birth cohort studies for example – all babies born in a particular week during a particular year, traced through their lives
3. Data structure and processing same as panel
Same respondents over time
T1 T3T2 T4
38
Retrospective
1. Not really a survey type but a data collection tool
2. Can be included in any of the surveys listed above
3. Data is (retrospectively) longitudinal
4. Each retrospective element needs to have unique codings for different events or episodes
5. Data structure is ‘relational’ each element relates to each respondent as well as to the respondents other retrospective elements
6. Data processing1. Time sensitive linkages of different elements
Looking backwards
T1T1 - X
40
Exercise
1. What type of survey design can help with the following ideas:1. Young people are taking longer to get married than they
used to
2. Fear of crime is highest in the urban environment
3. Employment and income are generally under-reported
4. The democrats will win the US presidency in 2008
41
Structure of DI2007
1. Cross-sectional
2. Individuals within HH
3. All HH members
4. Absent HH migrants
Household ID links each file
HH and Ind
HH members Absent Migrants
Structure on INTAS 2007
• Linked to DI 2005 so panel and hierarchical (though these properties not being used)
• Retrospective data collection• main file and 8 retrospective modules• relational structure
Each data file relates to each other (person ID)
Education
Marriage
Leisure
Employment
Job
Housing
Cohabitation
Core
Children
45
Deriving variables
46
Coding and recoding
1. Original codings (as in code book)
2. Simplifications1. Dealing with DK / NA and Missing codes
(tidying up)
2. Collapsing categories (substantive and statistical)
3. Improves analysis and presentation
4. See D1.sps – DI2007 and contributions of absent HH members
47
Creating analytic files
1. Protects the original data from being deleted/overwritten
2. Small files are processed faster
3. Less scrolling through data/variables
4. If syntax file used, it is easy to adapt (to include or delete variables)
5. See D2.sps
48
Deriving variables
1. Combining variables to produce a hybrid
2. Can be scale related to summarise a concept (ie where all response codes are of the same type – ‘safety’ example. See D3.sps)
3. Can relate to a broad conceptual category (social origins using parental education and employment. See CF1.sps)
4. To adjust data where you have reason to suspect that one variable might help to improve another (using reported expenditure to adjust reported income. See CF7.sps)
49
Data linkage
50
Hierarchical links
1. Data is nested: individual within household, or episodes belong to an individual.
2. Link 1: attach HH data to the individual (analysing individuals, not needed for DI as already linked)
3. Link 2 : produce summary data of all individuals in the HH, and attach to the HH data (analysing HH’s). See D6.sps, though it is a bit long.
4. Link 3: attach episode data to an individual. See CF2.sps.
5. See earlier slides on hierarchical data.
51
Longitudinal links
1. The respondents’ data from successive surveys is joined together
2. Cross-wave ID number used for both individual and family
3. Panel surveys and cohort surveys
52
Relational links
1. Linking an individuals marital statuses and fertility statuses
2. Linking an individuals education / employment and job statuses
3. Linking both of the above
4. …adding housing and leisure
5. (so-called ‘many to many’ links ie. one individual may have had 5 jobs, 4 different addresses, 2 marriages, 4 children and so on… others might have had much less.
6. See CF5.sps and CF6.sps
53
Data Processing Coursework
54
Survey Documentation
55
Survey Documentation exercise
1. In groups detail exactly what is needed to effectively begin analysing a survey data set.
2. try to be as precise as possible about the type of documentation, its content and the amount of detail that is required.
3. If you were to manage an archive of different social surveys, what would be on your check list of details to catalogue the surveys?
56
Survey Documentation (ESDS)– all variables should be named. Ideally, variable names should not exceed 8
characters, which ensures compatibility between all current dissemination formats used by the ESDS. The absolute maximum is 32 characters, which ensures compatibility with recent versions of all major dissemination (SPSS, ver. 12 onwards; Stata, ver. 7 onwards; and SAS, ver. 8 onwards)
– all variables should be labelled. Labels should be brief (preferably ‹ 80 characters), but precise and always make explicit the unit of measurement for continuous (interval) variables. Where possible, all variable labels should reference the question number (and if necessary questionnaire)
– where possible, all data labelling should be created and supplied to the ESDS as part of the data file itself. This is the expectation with data supplied in one of the three major statistical packages - SPSS, Stata or SAS
– if the package being used for data management does not allow such variable and code labelling it must be provided as part of the documentation - i.e. a comprehensive list of variable names, variable descriptions, code names and variable formatting information
– the code used to create all derived variables (e.g. the SPSS syntax file or Stata do file) should be provided so that interested users can see exactly how these variables have arisen
– See also ESDS ‘Research Management and Documentation’
57
Survey Documentation (Metadata)
1. Rationale / purpose / history
2. Questionnaires
3. Code book
4. Technical details (design issues, sampling, weighting, imputation etc)
5. Technical details for users (user based examples)
6. Publications (working papers / technical reports / academic papers)
58
Example
1. www.iser.essex.ac.uk/ulsc/bhps/doc/
59
User Guides
60
What is a thematic data user guide?
1. A tool to assist researchers in locating and using data ie. it is not meant to provide all answers, only to point to sources which contain more detail
2. An up-to-date collation of research/publications on a particular topic
3. A brief description of different data sources
4. A brief description of different research projects, their aims, and design.
5. A collation of different theoretical questions of relevance
6. See ESDS Education / Social Capital
61
Examples
1. www.esds.ac.uk/support/thematicguides.asp
62
Exercise
1. Select a substantive topic of interest to you
2. List current data sources / theories / publications
3. Draft an introduction to this topic that would help a researcher to quickly learn the main issues.